pandas
pandas - a powerful data analysis and manipulation library for Python
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
Main Features
Here are just a few of the things that pandas does well:
Easy handling of missing data in floating point as well as non-floating point data.
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations.
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.
Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects.
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
Intuitive merging and joining data sets.
Flexible reshaping and pivoting of data sets.
Hierarchical labeling of axes (possible to have multiple labels per tick).
Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format.
Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.
- class pandas.ArrowDtype[source]
An ExtensionDtype for PyArrow data types.
Warning
ArrowDtype is considered experimental. The implementation and parts of the API may change without warning.
While most
dtypearguments can accept the “string” constructor, e.g."int64[pyarrow]", ArrowDtype is useful if the data type contains parameters likepyarrow.timestamp.- Parameters:
pyarrow_dtype (pa.DataType) – An instance of a pyarrow.DataType.
- pyarrow_dtype
- None()
- Return type:
- Parameters:
pyarrow_dtype (pa.DataType) –
Examples
>>> import pyarrow as pa >>> pd.ArrowDtype(pa.int64()) int64[pyarrow]
Types with parameters must be constructed with ArrowDtype.
>>> pd.ArrowDtype(pa.timestamp("s", tz="America/New_York")) timestamp[s, tz=America/New_York][pyarrow] >>> pd.ArrowDtype(pa.list_(pa.int64())) list<item: int64>[pyarrow]
- property type
Returns associated scalar type.
- numpy_dtype
Return an instance of the related numpy dtype
- kind
- itemsize
Return the number of bytes in this dtype
- classmethod construct_array_type()[source]
Return the array type associated with this dtype.
- Return type:
- class pandas.BooleanDtype[source]
Extension dtype for boolean data.
Warning
BooleanDtype is considered experimental. The implementation and parts of the API may change without warning.
- None
- None()
Examples
>>> pd.BooleanDtype() BooleanDtype
- property type: type
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- property kind: str
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- property numpy_dtype: dtype
Return an instance of our numpy dtype
- class pandas.Categorical[source]
Represent a categorical variable in classic R / S-plus fashion.
Categoricals can only take on a limited, and usually fixed, number of possible values (categories). In contrast to statistical categorical variables, a Categorical might have an order, but numerical operations (additions, divisions, …) are not possible.
All values of the Categorical are either in categories or np.nan. Assigning values outside of categories will raise a ValueError. Order is defined by the order of the categories, not lexical order of the values.
- Parameters:
values (list-like) – The values of the categorical. If categories are given, values not in categories will be replaced with NaN.
categories (Index-like (unique), optional) – The unique categories for this categorical. If not given, the categories are assumed to be the unique values of values (sorted, if possible, otherwise in the order in which they appear).
ordered (bool, default False) – Whether or not this categorical is treated as a ordered categorical. If True, the resulting categorical will be ordered. An ordered categorical respects, when sorted, the order of its categories attribute (which in turn is the categories argument, if provided).
dtype (CategoricalDtype) – An instance of
CategoricalDtypeto use for this categorical.fastpath (bool) –
copy (bool) –
- codes
The codes (integer positions, which point to the categories) of this categorical, read only.
- Type:
ndarray
- dtype
The instance of
CategoricalDtypestoring thecategoriesandordered.- Type:
- Raises:
ValueError – If the categories do not validate.
TypeError – If an explicit
ordered=Trueis given but no categories and the values are not sortable.
- Parameters:
See also
CategoricalDtypeType for categorical data.
CategoricalIndexAn Index with an underlying
Categorical.
Notes
See the user guide for more.
Examples
>>> pd.Categorical([1, 2, 3, 1, 2, 3]) [1, 2, 3, 1, 2, 3] Categories (3, int64): [1, 2, 3]
>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c']) ['a', 'b', 'c', 'a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c']
Missing values are not included as a category.
>>> c = pd.Categorical([1, 2, 3, 1, 2, 3, np.nan]) >>> c [1, 2, 3, 1, 2, 3, NaN] Categories (3, int64): [1, 2, 3]
However, their presence is indicated in the codes attribute by code -1.
>>> c.codes array([ 0, 1, 2, 0, 1, 2, -1], dtype=int8)
Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max value.
>>> c = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True, ... categories=['c', 'b', 'a']) >>> c ['a', 'b', 'c', 'a', 'b', 'c'] Categories (3, object): ['c' < 'b' < 'a'] >>> c.min() 'c'
- property dtype: CategoricalDtype
The
CategoricalDtypefor this instance.
- classmethod from_codes(codes, categories=None, ordered=None, dtype=None)[source]
Make a Categorical type from codes and categories or dtype.
This constructor is useful if you already have codes and categories/dtype and so do not need the (computation intensive) factorization step, which is usually done on the constructor.
If your data does not follow this convention, please use the normal constructor.
- Parameters:
codes (array-like of int) – An integer array, where each integer points to a category in categories or dtype.categories, or else is -1 for NaN.
categories (index-like, optional) – The categories for the categorical. Items need to be unique. If the categories are not given here, then they must be provided in dtype.
ordered (bool, optional) – Whether or not this categorical is treated as an ordered categorical. If not given here or in dtype, the resulting categorical will be unordered.
dtype (CategoricalDtype or "category", optional) – If
CategoricalDtype, cannot be used together with categories or ordered.
- Return type:
Examples
>>> dtype = pd.CategoricalDtype(['a', 'b'], ordered=True) >>> pd.Categorical.from_codes(codes=[0, 1, 0, 1], dtype=dtype) ['a', 'b', 'a', 'b'] Categories (2, object): ['a' < 'b']
- property categories: Index
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in the new categories must be the same as the number of items in the old categories.
- Raises:
ValueError – If the new categories do not validate as categories or if the number of new categories is unequal the number of old categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
- property codes: ndarray
The category codes of this categorical.
Codes are an array of integers which are the positions of the actual values in the categories array.
There is no setter, use the other categorical methods and the normal item setter to change values in the categorical.
- Returns:
A non-writable view of the codes array.
- Return type:
ndarray[int]
- set_ordered(value)[source]
Set the ordered attribute to the boolean value.
- Parameters:
value (bool) – Set whether this categorical is ordered (True) or not (False).
- Return type:
- as_unordered()[source]
Set the Categorical to be unordered.
- Returns:
Unordered Categorical.
- Return type:
- set_categories(new_categories, ordered=None, rename=False)[source]
Set the categories to the specified new_categories.
new_categories can include new categories (which will result in unused categories) or remove old categories (which results in values set to NaN). If rename==True, the categories will simple be renamed (less or more items than in old categories will result in values set to NaN or in unused categories respectively).
This method can be used to perform more than one action of adding, removing, and reordering simultaneously and is therefore faster than performing the individual steps via the more specialised methods.
On the other hand this methods does not do checks (e.g., whether the old categories are included in the new categories on a reorder), which can result in surprising changes, for example when using special string dtypes, which does not considers a S1 string equal to a single char python string.
- Parameters:
new_categories (Index-like) – The categories in new order.
ordered (bool, default False) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.
rename (bool, default False) – Whether or not the new_categories should be considered as a rename of the old categories or as reordered categories.
- Return type:
Categorical with reordered categories.
- Raises:
ValueError – If new_categories does not validate as categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
- rename_categories(new_categories)[source]
Rename categories.
- Parameters:
new_categories (list-like, dict-like or callable) –
New categories which will replace old categories.
list-like: all items must be unique and the number of items in the new categories must match the existing number of categories.
dict-like: specifies a mapping from old categories to new. Categories not contained in the mapping are passed through and extra categories in the mapping are ignored.
callable : a callable that is called on all items in the old categories and whose return values comprise the new categories.
- Returns:
Categorical with renamed categories.
- Return type:
- Raises:
ValueError – If new categories are list-like and do not have the same number of items than the current categories or do not validate as categories
See also
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'a', 'b']) >>> c.rename_categories([0, 1]) [0, 0, 1] Categories (2, int64): [0, 1]
For dict-like
new_categories, extra keys are ignored and categories not in the dictionary are passed through>>> c.rename_categories({'a': 'A', 'c': 'C'}) ['A', 'A', 'b'] Categories (2, object): ['A', 'b']
You may also provide a callable to create the new categories
>>> c.rename_categories(lambda x: x.upper()) ['A', 'A', 'B'] Categories (2, object): ['A', 'B']
- reorder_categories(new_categories, ordered=None)[source]
Reorder categories as specified in new_categories.
new_categories need to include all old categories and no new category items.
- Parameters:
new_categories (Index-like) – The categories in new order.
ordered (bool, optional) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.
- Returns:
Categorical with reordered categories.
- Return type:
- Raises:
ValueError – If the new categories do not contain all old category items or any new ones
See also
rename_categoriesRename categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
- add_categories(new_categories)[source]
Add new categories.
new_categories will be included at the last/highest place in the categories and will be unused directly after this call.
- Parameters:
new_categories (category or list-like of category) – The new categories to be included.
- Returns:
Categorical with new categories added.
- Return type:
- Raises:
ValueError – If the new categories include old categories or do not validate as categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['c', 'b', 'c']) >>> c ['c', 'b', 'c'] Categories (2, object): ['b', 'c']
>>> c.add_categories(['d', 'a']) ['c', 'b', 'c'] Categories (4, object): ['b', 'c', 'd', 'a']
- remove_categories(removals)[source]
Remove the specified categories.
removals must be included in the old categories. Values which were in the removed categories will be set to NaN
- Parameters:
removals (category or list of categories) – The categories which should be removed.
- Returns:
Categorical with removed categories.
- Return type:
- Raises:
ValueError – If the removals are not contained in the categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd']) >>> c ['a', 'c', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_categories(['d', 'a']) [NaN, 'c', 'b', 'c', NaN] Categories (2, object): ['b', 'c']
- remove_unused_categories()[source]
Remove categories which are not used.
- Returns:
Categorical with unused categories dropped.
- Return type:
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd']) >>> c ['a', 'c', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c[2] = 'a' >>> c[4] = 'c' >>> c ['a', 'c', 'a', 'c', 'c'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_unused_categories() ['a', 'c', 'a', 'c', 'c'] Categories (2, object): ['a', 'c']
- map(mapper)[source]
Map categories using an input mapping or function.
Maps the categories to new categories. If the mapping correspondence is one-to-one the result is a
Categoricalwhich has the same order property as the original, otherwise aIndexis returned. NaN values are unaffected.If a dict or
Seriesis used any unmapped category is mapped to NaN. Note that if this happens anIndexwill be returned.- Parameters:
mapper (function, dict, or Series) – Mapping correspondence.
- Returns:
Mapped categorical.
- Return type:
See also
CategoricalIndex.mapApply a mapping correspondence on a
CategoricalIndex.Index.mapApply a mapping correspondence on an
Index.Series.mapApply a mapping correspondence on a
Series.Series.applyApply more complex functions on a
Series.
Examples
>>> cat = pd.Categorical(['a', 'b', 'c']) >>> cat ['a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c'] >>> cat.map(lambda x: x.upper()) ['A', 'B', 'C'] Categories (3, object): ['A', 'B', 'C'] >>> cat.map({'a': 'first', 'b': 'second', 'c': 'third'}) ['first', 'second', 'third'] Categories (3, object): ['first', 'second', 'third']
If the mapping is one-to-one the ordering of the categories is preserved:
>>> cat = pd.Categorical(['a', 'b', 'c'], ordered=True) >>> cat ['a', 'b', 'c'] Categories (3, object): ['a' < 'b' < 'c'] >>> cat.map({'a': 3, 'b': 2, 'c': 1}) [3, 2, 1] Categories (3, int64): [3 < 2 < 1]
If the mapping is not one-to-one an
Indexis returned:>>> cat.map({'a': 'first', 'b': 'second', 'c': 'first'}) Index(['first', 'second', 'first'], dtype='object')
If a dict is used, all unmapped categories are mapped to NaN and the result is an
Index:>>> cat.map({'a': 'first', 'b': 'second'}) Index(['first', 'second', nan], dtype='object')
- memory_usage(deep=False)[source]
Memory usage of my values
- Parameters:
deep (bool) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption
- Return type:
bytes used
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False
See also
numpy.ndarray.nbytes
- isna()[source]
Detect missing values
Missing values (-1 in .codes) are detected.
- Return type:
np.ndarray[bool] of whether my values are null
See also
isnaTop-level isna.
isnullAlias of isna.
Categorical.notnaBoolean inverse of Categorical.isna.
- isnull()
Detect missing values
Missing values (-1 in .codes) are detected.
- Return type:
np.ndarray[bool] of whether my values are null
See also
isnaTop-level isna.
isnullAlias of isna.
Categorical.notnaBoolean inverse of Categorical.isna.
- notna()[source]
Inverse of isna
Both missing values (-1 in .codes) and NA as a category are detected as null.
- Return type:
np.ndarray[bool] of whether my values are not null
See also
notnaTop-level notna.
notnullAlias of notna.
Categorical.isnaBoolean inverse of Categorical.notna.
- notnull()
Inverse of isna
Both missing values (-1 in .codes) and NA as a category are detected as null.
- Return type:
np.ndarray[bool] of whether my values are not null
See also
notnaTop-level notna.
notnullAlias of notna.
Categorical.isnaBoolean inverse of Categorical.notna.
- value_counts(dropna=True)[source]
Return a Series containing counts of each category.
Every category will have an entry, even those with a count of 0.
- Parameters:
dropna (bool, default True) – Don’t include counts of NaN.
- Returns:
counts
- Return type:
See also
Series.value_counts
- argsort(*, ascending=True, kind='quicksort', **kwargs)[source]
Return the indices that would sort the Categorical.
Missing values are sorted at the end.
- Parameters:
ascending (bool, default True) – Whether the indices should result in an ascending or descending sort.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, optional) – Sorting algorithm.
**kwargs – passed through to
numpy.argsort().
- Return type:
np.ndarray[np.intp]
See also
numpy.ndarray.argsortNotes
While an ordering is applied to the category values, arg-sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.
Examples
>>> pd.Categorical(['b', 'b', 'a', 'c']).argsort() array([2, 0, 1, 3])
>>> cat = pd.Categorical(['b', 'b', 'a', 'c'], ... categories=['c', 'b', 'a'], ... ordered=True) >>> cat.argsort() array([3, 0, 1, 2])
Missing values are placed at the end
>>> cat = pd.Categorical([2, None, 1]) >>> cat.argsort() array([2, 0, 1])
- sort_values(*, inplace: Literal[False] = False, ascending: bool = True, na_position: str = 'last') Categorical[source]
- sort_values(*, inplace: Literal[True], ascending: bool = True, na_position: str = 'last') None
Sort the Categorical by category value returning a new Categorical by default.
While an ordering is applied to the category values, sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.
- Parameters:
inplace (bool, default False) – Do operation in place.
ascending (bool, default True) – Order ascending. Passing False orders descending. The ordering parameter provides the method by which the category values are organized.
na_position ({'first', 'last'} (optional, default='last')) – ‘first’ puts NaNs at the beginning ‘last’ puts NaNs at the end
- Return type:
Categorical or None
See also
Categorical.sort,Series.sort_valuesExamples
>>> c = pd.Categorical([1, 2, 2, 1, 5]) >>> c [1, 2, 2, 1, 5] Categories (3, int64): [1, 2, 5] >>> c.sort_values() [1, 1, 2, 2, 5] Categories (3, int64): [1, 2, 5] >>> c.sort_values(ascending=False) [5, 2, 2, 1, 1] Categories (3, int64): [1, 2, 5]
>>> c = pd.Categorical([1, 2, 2, 1, 5])
‘sort_values’ behaviour with NaNs. Note that ‘na_position’ is independent of the ‘ascending’ parameter:
>>> c = pd.Categorical([np.nan, 2, 2, np.nan, 5]) >>> c [NaN, 2, 2, NaN, 5] Categories (2, int64): [2, 5] >>> c.sort_values() [2, 2, 5, NaN, NaN] Categories (2, int64): [2, 5] >>> c.sort_values(ascending=False) [5, 2, 2, NaN, NaN] Categories (2, int64): [2, 5] >>> c.sort_values(na_position='first') [NaN, NaN, 2, 2, 5] Categories (2, int64): [2, 5] >>> c.sort_values(ascending=False, na_position='first') [NaN, NaN, 5, 2, 2] Categories (2, int64): [2, 5]
- min(*, skipna=True, **kwargs)[source]
The minimum value of the object.
Only ordered Categoricals have a minimum!
- max(*, skipna=True, **kwargs)[source]
The maximum value of the object.
Only ordered Categoricals have a maximum!
- unique()[source]
Return the
Categoricalwhichcategoriesandcodesare unique.Changed in version 1.3.0: Previously, unused categories were dropped from the new categories.
- Return type:
Examples
>>> pd.Categorical(list("baabc")).unique() ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c'] >>> pd.Categorical(list("baab"), categories=list("abc"), ordered=True).unique() ['b', 'a'] Categories (3, object): ['a' < 'b' < 'c']
- equals(other)[source]
Returns True if categorical arrays are equal.
- Parameters:
other (Categorical) –
- Return type:
- describe()[source]
Describes this Categorical
- Returns:
description – A dataframe with frequency and counts by category.
- Return type:
DataFrame
- isin(values)[source]
Check whether values are contained in Categorical.
Return a boolean NumPy Array showing whether each element in the Categorical matches an element in the passed sequence of values exactly.
- Parameters:
values (set or list-like) – The sequence of values to test. Passing in a single string will raise a
TypeError. Instead, turn a single string into a list of one element.- Return type:
np.ndarray[bool]
- Raises:
If values is not a set or list-like
See also
pandas.Series.isinEquivalent method on Series.
Examples
>>> s = pd.Categorical(['lama', 'cow', 'lama', 'beetle', 'lama', ... 'hippo']) >>> s.isin(['cow', 'lama']) array([ True, True, True, False, True, False])
Passing a single string as
s.isin('lama')will raise an error. Use a list of one element instead:>>> s.isin(['lama']) array([ True, False, True, False, True, False])
- class pandas.CategoricalDtype[source]
Type for categorical data with the categories and orderedness.
- Parameters:
categories (sequence, optional) – Must be unique, and must not contain any nulls. The categories are stored in an Index, and if an index is provided the dtype of that index will be used.
ordered (bool or None, default False) – Whether or not this categorical is treated as a ordered categorical. None can be used to maintain the ordered value of existing categoricals when used in operations that combine categoricals, e.g. astype, and will resolve to False if there is no existing ordered to maintain.
- categories
- ordered
- None()
See also
CategoricalRepresent a categorical variable in classic R / S-plus fashion.
Notes
This class is useful for specifying the type of a
Categoricalindependent of the values. See categorical.categoricaldtype for more.Examples
>>> t = pd.CategoricalDtype(categories=['b', 'a'], ordered=True) >>> pd.Series(['a', 'b', 'a', 'c'], dtype=t) 0 a 1 b 2 a 3 NaN dtype: category Categories (2, object): ['b' < 'a']
An empty CategoricalDtype with a specific dtype can be created by providing an empty index. As follows,
>>> pd.CategoricalDtype(pd.DatetimeIndex([])).categories.dtype dtype('<M8[ns]')
- name = 'category'
- type
alias of
CategoricalDtypeType
- classmethod construct_from_string(string)[source]
Construct a CategoricalDtype from a string.
- classmethod construct_array_type()[source]
Return the array type associated with this dtype.
- Return type:
- static validate_ordered(ordered)[source]
Validates that we have a valid ordered parameter. If it is not a boolean, a TypeError will be raised.
- static validate_categories(categories, fastpath=False)[source]
Validates that we have good categories
- update_dtype(dtype)[source]
Returns a CategoricalDtype with categories and ordered taken from dtype if specified, otherwise falling back to self if unspecified
- Parameters:
dtype (CategoricalDtype) –
- Returns:
new_dtype
- Return type:
- class pandas.CategoricalIndex[source]
Index based on an underlying
Categorical.CategoricalIndex, like Categorical, can only take on a limited, and usually fixed, number of possible values (categories). Also, like Categorical, it might have an order, but numerical operations (additions, divisions, …) are not possible.
- Parameters:
data (array-like (1-dimensional)) – The values of the categorical. If categories are given, values not in categories will be replaced with NaN.
categories (index-like, optional) – The categories for the categorical. Items need to be unique. If the categories are not given here (and also not in dtype), they will be inferred from the data.
ordered (bool, optional) – Whether or not this categorical is treated as an ordered categorical. If not given here or in dtype, the resulting categorical will be unordered.
dtype (CategoricalDtype or "category", optional) – If
CategoricalDtype, cannot be used together with categories or ordered.copy (bool, default False) – Make a copy of input ndarray.
name (object, optional) – Name to be stored in the index.
- Return type:
- codes
- Type:
np.ndarray
- rename_categories()
- reorder_categories()
- add_categories()
- remove_categories()
- remove_unused_categories()
- set_categories()
- as_ordered()
- as_unordered()
- Raises:
ValueError – If the categories do not validate.
TypeError – If an explicit
ordered=Trueis given but no categories and the values are not sortable.
- Parameters:
dtype (Dtype | None) –
copy (bool) –
name (Hashable) –
- Return type:
See also
IndexThe base pandas Index type.
CategoricalA categorical array.
CategoricalDtypeType for categorical data.
Notes
See the user guide for more.
Examples
>>> pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"]) CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')
CategoricalIndexcan also be instantiated from aCategorical:>>> c = pd.Categorical(["a", "b", "c", "a", "b", "c"]) >>> pd.CategoricalIndex(c) CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')
Ordered
CategoricalIndexcan have a min and max value.>>> ci = pd.CategoricalIndex( ... ["a", "b", "c", "a", "b", "c"], ordered=True, categories=["c", "b", "a"] ... ) >>> ci CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'], categories=['c', 'b', 'a'], ordered=True, dtype='category') >>> ci.min() 'c'
- property codes
The category codes of this categorical.
Codes are an array of integers which are the positions of the actual values in the categories array.
There is no setter, use the other categorical methods and the normal item setter to change values in the categorical.
- Returns:
A non-writable view of the codes array.
- Return type:
ndarray[int]
- property categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in the new categories must be the same as the number of items in the old categories.
- Raises:
ValueError – If the new categories do not validate as categories or if the number of new categories is unequal the number of old categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
- property ordered
Whether the categories have an ordered relationship.
- reindex(target, method=None, level=None, limit=None, tolerance=None)[source]
Create index with target’s values (move/add/delete values as necessary)
- map(mapper)[source]
Map values using input an input mapping or function.
Maps the values (their categories, not the codes) of the index to new categories. If the mapping correspondence is one-to-one the result is a
CategoricalIndexwhich has the same order property as the original, otherwise anIndexis returned.If a dict or
Seriesis used any unmapped category is mapped to NaN. Note that if this happens anIndexwill be returned.- Parameters:
mapper (function, dict, or Series) – Mapping correspondence.
- Returns:
Mapped index.
- Return type:
See also
Index.mapApply a mapping correspondence on an
Index.Series.mapApply a mapping correspondence on a
Series.Series.applyApply more complex functions on a
Series.
Examples
>>> idx = pd.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
If the mapping is one-to-one the ordering of the categories is preserved:
>>> idx = pd.CategoricalIndex(['a', 'b', 'c'], ordered=True) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=True, dtype='category') >>> idx.map({'a': 3, 'b': 2, 'c': 1}) CategoricalIndex([3, 2, 1], categories=[3, 2, 1], ordered=True, dtype='category')
If the mapping is not one-to-one an
Indexis returned:>>> idx.map({'a': 'first', 'b': 'second', 'c': 'first'}) Index(['first', 'second', 'first'], dtype='object')
If a dict is used, all unmapped categories are mapped to NaN and the result is an
Index:>>> idx.map({'a': 'first', 'b': 'second'}) Index(['first', 'second', nan], dtype='object')
- add_categories(*args, **kwargs)
Add new categories.
new_categories will be included at the last/highest place in the categories and will be unused directly after this call.
- Parameters:
new_categories (category or list-like of category) – The new categories to be included.
- Returns:
Categorical with new categories added.
- Return type:
- Raises:
ValueError – If the new categories include old categories or do not validate as categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['c', 'b', 'c']) >>> c ['c', 'b', 'c'] Categories (2, object): ['b', 'c']
>>> c.add_categories(['d', 'a']) ['c', 'b', 'c'] Categories (4, object): ['b', 'c', 'd', 'a']
- argsort(*args, **kwargs)
Return the indices that would sort the Categorical.
Missing values are sorted at the end.
- Parameters:
ascending (bool, default True) – Whether the indices should result in an ascending or descending sort.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, optional) – Sorting algorithm.
**kwargs – passed through to
numpy.argsort().
- Return type:
np.ndarray[np.intp]
See also
numpy.ndarray.argsortNotes
While an ordering is applied to the category values, arg-sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.
Examples
>>> pd.Categorical(['b', 'b', 'a', 'c']).argsort() array([2, 0, 1, 3])
>>> cat = pd.Categorical(['b', 'b', 'a', 'c'], ... categories=['c', 'b', 'a'], ... ordered=True) >>> cat.argsort() array([3, 0, 1, 2])
Missing values are placed at the end
>>> cat = pd.Categorical([2, None, 1]) >>> cat.argsort() array([2, 0, 1])
- as_ordered(*args, **kwargs)
Set the Categorical to be ordered.
- Returns:
Ordered Categorical.
- Return type:
- as_unordered(*args, **kwargs)
Set the Categorical to be unordered.
- Returns:
Unordered Categorical.
- Return type:
- max(*args, **kwargs)
The maximum value of the object.
Only ordered Categoricals have a maximum!
- Raises:
TypeError – If the Categorical is not ordered.
- Returns:
max
- Return type:
the maximum of this Categorical, NA if array is empty
- min(*args, **kwargs)
The minimum value of the object.
Only ordered Categoricals have a minimum!
- Raises:
TypeError – If the Categorical is not ordered.
- Returns:
min
- Return type:
the minimum of this Categorical, NA value if empty
- remove_categories(*args, **kwargs)
Remove the specified categories.
removals must be included in the old categories. Values which were in the removed categories will be set to NaN
- Parameters:
removals (category or list of categories) – The categories which should be removed.
- Returns:
Categorical with removed categories.
- Return type:
- Raises:
ValueError – If the removals are not contained in the categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd']) >>> c ['a', 'c', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_categories(['d', 'a']) [NaN, 'c', 'b', 'c', NaN] Categories (2, object): ['b', 'c']
- remove_unused_categories(*args, **kwargs)
Remove categories which are not used.
- Returns:
Categorical with unused categories dropped.
- Return type:
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd']) >>> c ['a', 'c', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c[2] = 'a' >>> c[4] = 'c' >>> c ['a', 'c', 'a', 'c', 'c'] Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_unused_categories() ['a', 'c', 'a', 'c', 'c'] Categories (2, object): ['a', 'c']
- rename_categories(*args, **kwargs)
Rename categories.
- Parameters:
new_categories (list-like, dict-like or callable) –
New categories which will replace old categories.
list-like: all items must be unique and the number of items in the new categories must match the existing number of categories.
dict-like: specifies a mapping from old categories to new. Categories not contained in the mapping are passed through and extra categories in the mapping are ignored.
callable : a callable that is called on all items in the old categories and whose return values comprise the new categories.
- Returns:
Categorical with renamed categories.
- Return type:
- Raises:
ValueError – If new categories are list-like and do not have the same number of items than the current categories or do not validate as categories
See also
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
Examples
>>> c = pd.Categorical(['a', 'a', 'b']) >>> c.rename_categories([0, 1]) [0, 0, 1] Categories (2, int64): [0, 1]
For dict-like
new_categories, extra keys are ignored and categories not in the dictionary are passed through>>> c.rename_categories({'a': 'A', 'c': 'C'}) ['A', 'A', 'b'] Categories (2, object): ['A', 'b']
You may also provide a callable to create the new categories
>>> c.rename_categories(lambda x: x.upper()) ['A', 'A', 'B'] Categories (2, object): ['A', 'B']
- reorder_categories(*args, **kwargs)
Reorder categories as specified in new_categories.
new_categories need to include all old categories and no new category items.
- Parameters:
new_categories (Index-like) – The categories in new order.
ordered (bool, optional) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.
- Returns:
Categorical with reordered categories.
- Return type:
- Raises:
ValueError – If the new categories do not contain all old category items or any new ones
See also
rename_categoriesRename categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
set_categoriesSet the categories to the specified ones.
- searchsorted(*args, **kwargs)
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array self (a) such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.
Assuming that self is sorted:
side
returned index i satisfies
left
self[i-1] < value <= self[i]right
self[i-1] <= value < self[i]- Parameters:
value (array-like, list or scalar) – Value(s) to insert into self.
side ({'left', 'right'}, optional) – If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).
sorter (1-D array-like, optional) – Optional array of integer indices that sort array a into ascending order. They are typically the result of argsort.
- Returns:
If value is array-like, array of insertion points. If value is scalar, a single integer.
- Return type:
array of ints or int
See also
numpy.searchsortedSimilar method from NumPy.
- set_categories(*args, **kwargs)
Set the categories to the specified new_categories.
new_categories can include new categories (which will result in unused categories) or remove old categories (which results in values set to NaN). If rename==True, the categories will simple be renamed (less or more items than in old categories will result in values set to NaN or in unused categories respectively).
This method can be used to perform more than one action of adding, removing, and reordering simultaneously and is therefore faster than performing the individual steps via the more specialised methods.
On the other hand this methods does not do checks (e.g., whether the old categories are included in the new categories on a reorder), which can result in surprising changes, for example when using special string dtypes, which does not considers a S1 string equal to a single char python string.
- Parameters:
new_categories (Index-like) – The categories in new order.
ordered (bool, default False) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.
rename (bool, default False) – Whether or not the new_categories should be considered as a rename of the old categories or as reordered categories.
- Return type:
Categorical with reordered categories.
- Raises:
ValueError – If new_categories does not validate as categories
See also
rename_categoriesRename categories.
reorder_categoriesReorder categories.
add_categoriesAdd new categories.
remove_categoriesRemove the specified categories.
remove_unused_categoriesRemove categories which are not used.
- class pandas.DataFrame[source]
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
- Parameters:
data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.
If data is a list of dicts, column order follows insertion-order.
index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns (Index or array-like) – Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer.
copy (bool or None, default None) –
Copy data from inputs. For dict data, the default of None behaves like
copy=True. For DataFrame or 2d ndarray input, the default of None behaves likecopy=False. If data is a dict containing one or more Series (possibly of different dtypes),copy=Falsewill ensure that these inputs are not copied.Changed in version 1.3.0.
See also
DataFrame.from_recordsConstructor from tuples, also record arrays.
DataFrame.from_dictFrom dicts of Series, arrays, or dicts.
read_csvRead a comma-separated values (csv) file into DataFrame.
read_tableRead general delimited file into DataFrame.
read_clipboardRead text from clipboard into DataFrame.
Notes
Please reference the User Guide for more information.
Examples
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes col1 int64 col2 int64 dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object
Constructing DataFrame from a dictionary including Series:
>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9
Constructing DataFrame from a numpy ndarray that has labeled columns:
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7
Constructing DataFrame from dataclass:
>>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3
Constructing DataFrame from Series/DataFrame:
>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3
>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
- property axes: list[pandas.core.indexes.base.Index]
Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.axes [RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'], dtype='object')]
- property shape: tuple[int, int]
Return a tuple representing the dimensionality of the DataFrame.
See also
ndarray.shapeTuple of array dimensions.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.shape (2, 2)
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], ... 'col3': [5, 6]}) >>> df.shape (2, 3)
- to_string(buf: None = None, columns: Sequence[str] | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) str[source]
- to_string(buf: FilePath | WriteBuffer[str], columns: Sequence[str] | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) None
Render a DataFrame to a console-friendly tabular output.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (int, list or dict of int, optional) – The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..
header (bool or sequence of str, optional) – Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of
NaNto use.formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
float_format (one-parameter function, optional, default None) –
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaNelements, withNaNbeing handled byna_rep.Changed in version 1.2.0.
sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_width (int, optional) – Width to wrap a line in characters.
min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).
max_colwidth (int, optional) – Max width to truncate each column in characters. By default, no limit.
encoding (str, default "utf-8") – Set character encoding.
- Returns:
If buf is None, returns the result as a string. Otherwise returns None.
- Return type:
str or None
See also
to_htmlConvert DataFrame to HTML.
Examples
>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} >>> df = pd.DataFrame(d) >>> print(df.to_string()) col1 col2 0 1 4 1 2 5 2 3 6
- property style: Styler
Returns a Styler object.
Contains methods for building a styled HTML representation of the DataFrame.
See also
io.formats.style.StylerHelps style a DataFrame or Series according to the data with HTML and CSS.
- items()[source]
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
- Yields:
label (object) – The column names for the DataFrame being iterated over.
content (Series) – The column entries belonging to each label, as a Series.
- Return type:
See also
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuplesIterate over DataFrame rows as namedtuples of the values.
Examples
>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'], ... 'population': [1864, 22000, 80000]}, ... index=['panda', 'polar', 'koala']) >>> df species population panda bear 1864 polar bear 22000 koala marsupial 80000 >>> for label, content in df.items(): ... print(f'label: {label}') ... print(f'content: {content}', sep='\n') ... label: species content: panda bear polar bear koala marsupial Name: species, dtype: object label: population content: panda 1864 polar 22000 koala 80000 Name: population, dtype: int64
- iterrows()[source]
Iterate over DataFrame rows as (index, Series) pairs.
- Yields:
index (label or tuple of label) – The index of the row. A tuple for a MultiIndex.
data (Series) – The data of the row as a Series.
- Return type:
See also
DataFrame.itertuplesIterate over DataFrame rows as namedtuples of the values.
DataFrame.itemsIterate over (column name, Series) pairs.
Notes
Because
iterrowsreturns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()which returns namedtuples of the values and which is generally faster thaniterrows.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
- itertuples(index=True, name='Pandas')[source]
Iterate over DataFrame rows as namedtuples.
- Parameters:
- Returns:
An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.
- Return type:
iterator
See also
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
DataFrame.itemsIterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore.
Examples
>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, ... index=['dog', 'hawk']) >>> df num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='dog', num_legs=4, num_wings=0) Pandas(Index='hawk', num_legs=2, num_wings=2)
By setting the index parameter to False we can remove the index as the first element of the tuple:
>>> for row in df.itertuples(index=False): ... print(row) ... Pandas(num_legs=4, num_wings=0) Pandas(num_legs=2, num_wings=2)
With the name parameter set we set a custom name for the yielded namedtuples:
>>> for row in df.itertuples(name='Animal'): ... print(row) ... Animal(Index='dog', num_legs=4, num_wings=0) Animal(Index='hawk', num_legs=2, num_wings=2)
- dot(other: Series) Series[source]
- dot(other: DataFrame | Index | ExtensionArray | ndarray) DataFrame
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.
It can also be called using
self @ otherin Python >= 3.5.- Parameters:
other (Series, DataFrame or array-like) – The other object to compute the matrix product with.
- Returns:
If other is a Series, return the matrix product between self and other as a Series. If other is a DataFrame or a numpy.array, return the matrix product of self and other in a DataFrame of a np.array.
- Return type:
See also
Series.dotSimilar method for Series.
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
Here we multiply a DataFrame with a Series.
>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> s = pd.Series([1, 1, 2, 1]) >>> df.dot(s) 0 -4 1 5 dtype: int64
Here we multiply a DataFrame with another DataFrame.
>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(other) 0 1 0 1 4 1 2 2
Note that the dot method give the same result as @
>>> df @ other 0 1 0 1 4 1 2 2
The dot method works also if other is an np.array.
>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(arr) 0 1 0 1 4 1 2 2
Note how shuffling of the objects does not change the result.
>>> s2 = s.reindex([1, 0, 2, 3]) >>> df.dot(s2) 0 -4 1 5 dtype: int64
- classmethod from_dict(data, orient='columns', dtype=None, columns=None)[source]
Construct DataFrame from dict of array-like or dicts.
Creates DataFrame object from dictionary by columns or by index allowing dtype specification.
- Parameters:
data (dict) – Of the form {field : array-like} or {field : dict}.
orient ({'columns', 'index', 'tight'}, default 'columns') –
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].
New in version 1.4.0: ‘tight’ as an allowed value for the
orientargumentdtype (dtype, default None) – Data type to force after DataFrame construction, otherwise infer.
columns (list, default None) – Column labels to use when
orient='index'. Raises a ValueError if used withorient='columns'ororient='tight'.
- Return type:
See also
DataFrame.from_recordsDataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrameDataFrame object creation using constructor.
DataFrame.to_dictConvert the DataFrame to a dictionary.
Examples
By default the keys of the dict become the DataFrame columns:
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Specify
orient='index'to create the DataFrame using dictionary keys as rows:>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data, orient='index') 0 1 2 3 row_1 3 2 1 0 row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
>>> pd.DataFrame.from_dict(data, orient='index', ... columns=['A', 'B', 'C', 'D']) A B C D row_1 3 2 1 0 row_2 a b c d
Specify
orient='tight'to create the DataFrame using a ‘tight’ format:>>> data = {'index': [('a', 'b'), ('a', 'c')], ... 'columns': [('x', 1), ('y', 2)], ... 'data': [[1, 3], [2, 4]], ... 'index_names': ['n1', 'n2'], ... 'column_names': ['z1', 'z2']} >>> pd.DataFrame.from_dict(data, orient='tight') z1 x y z2 1 2 n1 n2 a b 1 3 c 2 4
- to_numpy(dtype=None, copy=False, na_value=_NoDefault.no_default)[source]
Convert the DataFrame to a NumPy array.
By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame. For example, if the dtypes are
float16andfloat32, the results dtype will befloat32. This may require copying data and coercing values, which may be expensive.- Parameters:
dtype (str or numpy.dtype, optional) – The dtype to pass to
numpy.asarray().copy (bool, default False) – Whether to ensure that the returned value is not a view on another array. Note that
copy=Falsedoes not ensure thatto_numpy()is no-copy. Rather,copy=Trueensure that a copy is made, even if not strictly necessary.na_value (Any, optional) –
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
New in version 1.1.0.
- Return type:
numpy.ndarray
See also
Series.to_numpySimilar method for Series.
Examples
>>> pd.DataFrame({"A": [1, 2], "B": [3, 4]}).to_numpy() array([[1, 3], [2, 4]])
With heterogeneous data, the lowest common type will have to be used.
>>> df = pd.DataFrame({"A": [1, 2], "B": [3.0, 4.5]}) >>> df.to_numpy() array([[1. , 3. ], [2. , 4.5]])
For a mix of numeric and non-numeric types, the output array will have object dtype.
>>> df['C'] = pd.date_range('2000', periods=2) >>> df.to_numpy() array([[1, 3.0, Timestamp('2000-01-01 00:00:00')], [2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
- to_dict(orient: ~typing.Literal['dict', 'list', 'series', 'split', 'tight', 'index'] = 'dict', into: type[dict] = <class 'dict'>) dict[source]
- to_dict(orient: ~typing.Literal['records'], into: type[dict] = <class 'dict'>) list[dict]
Convert the DataFrame to a dictionary.
The type of the key-value pairs can be customized with the parameters (see below).
- Parameters:
orient (str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}) –
Determines the type of the values of the dictionary.
’dict’ (default) : dict like {column -> {index -> value}}
’list’ : dict like {column -> [values]}
’series’ : dict like {column -> Series(values)}
’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
’tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
New in version 1.4.0: ‘tight’ as an allowed value for the
orientargumentinto (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
index (bool, default True) –
Whether to include the index item (and index_names item if orient is ‘tight’) in the returned dictionary. Can only be
Falsewhen orient is ‘split’ or ‘tight’.New in version 2.0.0.
- Returns:
Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.
- Return type:
See also
DataFrame.from_dictCreate a DataFrame from a dictionary.
DataFrame.to_jsonConvert a DataFrame to JSON format.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], ... 'col2': [0.5, 0.75]}, ... index=['row1', 'row2']) >>> df col1 col2 row1 1 0.50 row2 2 0.75 >>> df.to_dict() {'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can specify the return orientation.
>>> df.to_dict('series') {'col1': row1 1 row2 2 Name: col1, dtype: int64, 'col2': row1 0.50 row2 0.75 Name: col2, dtype: float64}
>>> df.to_dict('split') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records') [{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index') {'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> df.to_dict('tight') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
You can also specify the mapping type.
>>> from collections import OrderedDict, defaultdict >>> df.to_dict(into=OrderedDict) OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
If you want a defaultdict, you need to initialize it:
>>> dd = defaultdict(list) >>> df.to_dict('records', into=dd) [defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}), defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
- to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=True, table_schema=None, location=None, progress_bar=True, credentials=None)[source]
Write a DataFrame to a Google BigQuery table.
This function requires the pandas-gbq package.
See the How to authenticate with Google BigQuery guide for authentication instructions.
- Parameters:
destination_table (str) – Name of table to be written, in the form
dataset.tablename.project_id (str, optional) – Google BigQuery Account project ID. Optional when available from the environment.
chunksize (int, optional) – Number of rows to be inserted in each chunk from the dataframe. Set to
Noneto load the whole dataframe at once.reauth (bool, default False) – Force Google BigQuery to re-authenticate the user. This is useful if multiple accounts are used.
if_exists (str, default 'fail') –
Behavior when the destination table exists. Value can be one of:
'fail'If table exists raise pandas_gbq.gbq.TableCreationError.
'replace'If table exists, drop it, recreate it, and insert data.
'append'If table exists, insert data. Create if does not exist.
auth_local_webserver (bool, default True) –
Use the local webserver flow instead of the console flow when getting user credentials.
New in version 0.2.0 of pandas-gbq.
Changed in version 1.5.0: Default value is changed to
True. Google has deprecated theauth_local_webserver = False“out of band” (copy-paste) flow.table_schema (list of dicts, optional) –
List of BigQuery table fields to which according DataFrame columns conform to, e.g.
[{'name': 'col1', 'type': 'STRING'},...]. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.New in version 0.3.1 of pandas-gbq.
location (str, optional) –
Location where the load job should run. See the BigQuery locations documentation for a list of available locations. The location must match that of the target dataset.
New in version 0.5.0 of pandas-gbq.
progress_bar (bool, default True) –
Use the library tqdm to show the progress bar for the upload, chunk by chunk.
New in version 0.5.0 of pandas-gbq.
credentials (google.auth.credentials.Credentials, optional) –
Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentialsor Service Accountgoogle.oauth2.service_account.Credentialsdirectly.New in version 0.8.0 of pandas-gbq.
- Return type:
None
See also
pandas_gbq.to_gbqThis function in the pandas-gbq library.
read_gbqRead a DataFrame from Google BigQuery.
- classmethod from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]
Convert structured or record ndarray to DataFrame.
Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.
- Parameters:
data (structured ndarray, sequence of tuples or dicts, or DataFrame) – Structured input data.
index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of input labels to use.
exclude (sequence, default None) – Columns or fields to exclude.
columns (sequence, default None) – Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).
coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
nrows (int, default None) – Number of rows to read if data is an iterator.
- Return type:
See also
DataFrame.from_dictDataFrame from dict of array-like or dicts.
DataFrameDataFrame object creation using constructor.
Examples
Data can be provided as a structured ndarray:
>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], ... dtype=[('col_1', 'i4'), ('col_2', 'U1')]) >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of dicts:
>>> data = [{'col_1': 3, 'col_2': 'a'}, ... {'col_1': 2, 'col_2': 'b'}, ... {'col_1': 1, 'col_2': 'c'}, ... {'col_1': 0, 'col_2': 'd'}] >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of tuples with corresponding columns:
>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')] >>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2']) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
- to_records(index=True, column_dtypes=None, index_dtypes=None)[source]
Convert DataFrame to a NumPy record array.
Index will be included as the first field of the record array if requested.
- Parameters:
index (bool, default True) – Include index in resulting record array, stored in ‘index’ field or using the index label, if set.
column_dtypes (str, type, dict, default None) – If a string or type, the data type to store all columns. If a dictionary, a mapping of column names and indices (zero-indexed) to specific data types.
index_dtypes (str, type, dict, default None) –
If a string or type, the data type to store all index levels. If a dictionary, a mapping of index level names and indices (zero-indexed) to specific data types.
This mapping is applied only if index=True.
- Returns:
NumPy ndarray with the DataFrame labels as fields and each row of the DataFrame as entries.
- Return type:
numpy.recarray
See also
DataFrame.from_recordsConvert structured or record ndarray to DataFrame.
numpy.recarrayAn ndarray that allows field access using attributes, analogous to typed columns in a spreadsheet.
Examples
>>> df = pd.DataFrame({'A': [1, 2], 'B': [0.5, 0.75]}, ... index=['a', 'b']) >>> df A B a 1 0.50 b 2 0.75 >>> df.to_records() rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('index', 'O'), ('A', '<i8'), ('B', '<f8')])
If the DataFrame index has no label then the recarray field name is set to ‘index’. If the index has a label then this is used as the field name:
>>> df.index = df.index.rename("I") >>> df.to_records() rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('I', 'O'), ('A', '<i8'), ('B', '<f8')])
The index can be excluded from the record array:
>>> df.to_records(index=False) rec.array([(1, 0.5 ), (2, 0.75)], dtype=[('A', '<i8'), ('B', '<f8')])
Data types can be specified for the columns:
>>> df.to_records(column_dtypes={"A": "int32"}) rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('I', 'O'), ('A', '<i4'), ('B', '<f8')])
As well as for the index:
>>> df.to_records(index_dtypes="<S2") rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)], dtype=[('I', 'S2'), ('A', '<i8'), ('B', '<f8')])
>>> index_dtypes = f"<S{df.index.str.len().max()}" >>> df.to_records(index_dtypes=index_dtypes) rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)], dtype=[('I', 'S1'), ('A', '<i8'), ('B', '<f8')])
- to_stata(path, *, convert_dates=None, write_index=True, byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None, compression='infer', storage_options=None, value_labels=None)[source]
Export DataFrame object to Stata dta format.
Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.
- Parameters:
path (str, path object, or buffer) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function.convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.
write_index (bool) – Write the index to Stata dataset.
byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.
data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
version ({114, 117, 118, 119, None}, default 114) –
Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. Version 114 can be read by Stata 10 and later. Version 117 can be read by Stata 13 or later. Version 118 is supported in Stata 14 and later. Version 119 is supported in Stata 15 and later. Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and version 119 supports more than 32,767 variables.
Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read version 119 files.
convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
Nonefor no compression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdCompressorortarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.New in version 1.5.0: Added support for .tar files.
New in version 1.1.0.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
value_labels (dict of dicts) –
Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.
New in version 1.4.0.
- Raises:
If datetimes contain timezone information * Column dtype is not representable in Stata
Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters
- Return type:
None
See also
read_stataImport Stata data files.
io.stata.StataWriterLow-level writer for Stata data files.
io.stata.StataWriter117Low-level writer for version 117 files.
Examples
>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', ... 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta')
- to_feather(path, **kwargs)[source]
Write a DataFrame to the binary Feather format.
- Parameters:
path (str, path object, file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.**kwargs –
Additional keywords passed to
pyarrow.feather.write_feather(). Starting with pyarrow 0.17, this includes the compression, compression_level, chunksize and version keywords.New in version 1.1.0.
- Return type:
None
Notes
This function writes the dataframe as a feather file. Requires a default index. For saving the DataFrame with your custom index use a method that supports custom indices e.g. to_parquet.
- to_markdown(buf=None, mode='wt', index=True, storage_options=None, **kwargs)[source]
Print DataFrame in Markdown-friendly format.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) –
Add index (row) labels.
New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
**kwargs – These parameters will be passed to tabulate.
- Returns:
DataFrame in Markdown-friendly format.
- Return type:
Notes
Requires the tabulate package.
- Examples
>>> df = pd.DataFrame( ... data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]} ... ) >>> print(df.to_markdown()) | | animal_1 | animal_2 | |---:|:-----------|:-----------| | 0 | elk | dog | | 1 | pig | quetzal |
Output markdown with a tabulate option.
>>> print(df.to_markdown(tablefmt="grid")) +----+------------+------------+ | | animal_1 | animal_2 | +====+============+============+ | 0 | elk | dog | +----+------------+------------+ | 1 | pig | quetzal | +----+------------+------------+
- to_parquet(path: None = None, engine: str = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: Dict[str, Any] | None = None, **kwargs) bytes[source]
- to_parquet(path: FilePath | WriteBuffer[bytes], engine: str = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: Dict[str, Any] | None = None, **kwargs) None
Write a DataFrame to the binary parquet format.
This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.
- Parameters:
path (str, path object, file-like object, or None, default None) –
String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.Changed in version 1.2.0.
Previously this was “fname”
engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engineis used. The defaultio.parquet.enginebehavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use
Nonefor no compression.index (bool, default None) – If
True, include the dataframe’s index(es) in the file output. IfFalse, they will not be written to the file. IfNone, similar toTruethe dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
**kwargs – Additional arguments passed to the parquet library. See pandas io for more details.
- Return type:
bytes if no path argument is provided else None
See also
read_parquetRead a parquet file.
DataFrame.to_orcWrite an orc file.
DataFrame.to_csvWrite a csv file.
DataFrame.to_sqlWrite to a sql table.
DataFrame.to_hdfWrite to hdf.
Notes
This function requires either the fastparquet or pyarrow library.
Examples
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) >>> df.to_parquet('df.parquet.gzip', ... compression='gzip') >>> pd.read_parquet('df.parquet.gzip') col1 col2 0 1 3 1 2 4
If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don’t use partition_cols, which creates multiple files.
>>> import io >>> f = io.BytesIO() >>> df.to_parquet(f) >>> f.seek(0) 0 >>> content = f.read()
- to_orc(path=None, *, engine='pyarrow', index=None, engine_kwargs=None)[source]
Write a DataFrame to the ORC format.
New in version 1.5.0.
- Parameters:
path (str, file-like object or None, default None) – If a string, it will be used as Root Directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function). If path is None, a bytes object is returned.
engine (str, default 'pyarrow') – ORC library to use. Pyarrow must be >= 7.0.0.
index (bool, optional) – If
True, include the dataframe’s index(es) in the file output. IfFalse, they will not be written to the file. IfNone, similar toinferthe dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.engine_kwargs (dict[str, Any] or None, default None) – Additional keyword arguments passed to
pyarrow.orc.write_table().
- Return type:
bytes if no path argument is provided else None
- Raises:
NotImplementedError – Dtype of one or more columns is category, unsigned integers, interval, period or sparse.
ValueError – engine is not pyarrow.
See also
read_orcRead a ORC file.
DataFrame.to_parquetWrite a parquet file.
DataFrame.to_csvWrite a csv file.
DataFrame.to_sqlWrite to a sql table.
DataFrame.to_hdfWrite to hdf.
Notes
Before using this function you should read the user guide about ORC and install optional dependencies.
This function requires pyarrow library.
For supported dtypes please refer to supported ORC features in Arrow.
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
Examples
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]}) >>> df.to_orc('df.orc') >>> pd.read_orc('df.orc') col1 col2 0 1 4 1 2 3
If you want to get a buffer to the orc content you can write it to io.BytesIO >>> import io >>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP >>> b.seek(0) # doctest: +SKIP 0 >>> content = b.read() # doctest: +SKIP
- to_html(buf: FilePath | WriteBuffer[str], columns: Sequence[Hashable] | None = None, col_space: str | int | Sequence[str | int] | Mapping[Hashable, str | int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) None[source]
- to_html(buf: None = None, columns: Sequence[Hashable] | None = None, col_space: str | int | Sequence[str | int] | Mapping[Hashable, str | int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) str
Render a DataFrame as an HTML table.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..
header (bool, optional) – Whether to print column labels, default True.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of
NaNto use.formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
float_format (one-parameter function, optional, default None) –
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaNelements, withNaNbeing handled byna_rep.Changed in version 1.2.0.
sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rows (bool, default True) – Make the row labels bold in the output.
classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
border (int) – A
border=borderattribute is included in the opening <table> tag. Defaultpd.options.display.html.border.table_id (str, optional) – A css id is included in the opening <table> tag if specified.
render_links (bool, default False) – Convert URLs to HTML links.
encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
- Returns:
If buf is None, returns the result as a string. Otherwise returns None.
- Return type:
str or None
See also
to_stringConvert DataFrame to a string.
- to_xml(path_or_buffer=None, index=True, root_name='data', row_name='row', na_rep=None, attr_cols=None, elem_cols=None, namespaces=None, prefix=None, encoding='utf-8', xml_declaration=True, pretty_print=True, parser='lxml', stylesheet=None, compression='infer', storage_options=None)[source]
Render a DataFrame to an XML document.
New in version 1.3.0.
- Parameters:
path_or_buffer (str, path object, file-like object, or None, default None) – String, path object (implementing
os.PathLike[str]), or file-like object implementing awrite()function. If None, the result is returned as a string.index (bool, default True) – Whether to include index in XML document.
root_name (str, default 'data') – The name of root element in XML document.
row_name (str, default 'row') – The name of row element in XML document.
na_rep (str, optional) – Missing data representation.
attr_cols (list-like, optional) – List of columns to write as attributes in row element. Hierarchical columns will be flattened with underscore delimiting the different levels.
elem_cols (list-like, optional) – List of columns to write as children in row element. By default, all columns output as children of row element. Hierarchical columns will be flattened with underscore delimiting the different levels.
namespaces (dict, optional) –
All namespaces to be defined in root element. Keys of dict should be prefix names and values of dict corresponding URIs. Default namespaces should be given empty string key. For example,
namespaces = {"": "https://example.com"}
prefix (str, optional) – Namespace prefix to be used for every element and/or attribute in document. This should be one of the keys in
namespacesdict.encoding (str, default 'utf-8') – Encoding of the resulting document.
xml_declaration (bool, default True) – Whether to include the XML declaration at start of document.
pretty_print (bool, default True) – Whether output should be pretty printed with indentation and line breaks.
parser ({'lxml','etree'}, default 'lxml') – Parser module to use for building of tree. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’, the ability to use XSLT stylesheet is supported.
stylesheet (str, path object or file-like object, optional) – A URL, file-like object, or a raw string containing an XSLT script used to transform the raw XML output. Script should use layout of elements and attributes from original output. This argument requires
lxmlto be installed. Only XSLT 1.0 scripts and not later versions is currently supported.compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
Nonefor no compression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdCompressorortarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.
- Returns:
If
iois None, returns the resulting XML format as a string. Otherwise returns None.- Return type:
None or str
See also
to_jsonConvert the pandas object to a JSON string.
to_htmlConvert DataFrame to a html.
Examples
>>> df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'], ... 'degrees': [360, 360, 180], ... 'sides': [4, np.nan, 3]})
>>> df.to_xml() <?xml version='1.0' encoding='utf-8'?> <data> <row> <index>0</index> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <index>1</index> <shape>circle</shape> <degrees>360</degrees> <sides/> </row> <row> <index>2</index> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </row> </data>
>>> df.to_xml(attr_cols=[ ... 'index', 'shape', 'degrees', 'sides' ... ]) <?xml version='1.0' encoding='utf-8'?> <data> <row index="0" shape="square" degrees="360" sides="4.0"/> <row index="1" shape="circle" degrees="360"/> <row index="2" shape="triangle" degrees="180" sides="3.0"/> </data>
>>> df.to_xml(namespaces={"doc": "https://example.com"}, ... prefix="doc") <?xml version='1.0' encoding='utf-8'?> <doc:data xmlns:doc="https://example.com"> <doc:row> <doc:index>0</doc:index> <doc:shape>square</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides>4.0</doc:sides> </doc:row> <doc:row> <doc:index>1</doc:index> <doc:shape>circle</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides/> </doc:row> <doc:row> <doc:index>2</doc:index> <doc:shape>triangle</doc:shape> <doc:degrees>180</doc:degrees> <doc:sides>3.0</doc:sides> </doc:row> </doc:data>
- info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)[source]
Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
- Parameters:
verbose (bool, optional) – Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columnsis followed.buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
max_cols (int, optional) – When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in
pandas.options.display.max_info_columnsis used.memory_usage (bool, str, optional) –
Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usagesetting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.
show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than
pandas.options.display.max_info_rowsandpandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.
- Returns:
This method prints a summary of a DataFrame and returns None.
- Return type:
None
See also
DataFrame.describeGenerate descriptive statistics of DataFrame columns.
DataFrame.memory_usageMemory usage of DataFrame columns.
Examples
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0] >>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values, ... "float_col": float_values}) >>> df int_col text_col float_col 0 1 alpha 0.00 1 2 beta 0.25 2 3 gamma 0.50 3 4 delta 0.75 4 5 epsilon 1.00
Prints information of all columns:
>>> df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int_col 5 non-null int64 1 text_col 5 non-null object 2 float_col 5 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Columns: 3 entries, int_col to float_col dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> df.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: ... f.write(s) 260
The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:
>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> df = pd.DataFrame({ ... 'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6) ... }) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 22.9+ MB
>>> df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 165.9 MB
- memory_usage(index=True, deep=False)[source]
Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting
pandas.options.display.memory_usageto False.- Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True, the memory usage of the index is the first item in the output.deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
- Returns:
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
- Return type:
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of an ndarray.
Series.memory_usageBytes consumed by a Series.
CategoricalMemory-efficient array for string values with many repeated values.
DataFrame.infoConcise summary of a DataFrame.
Notes
See the Frequently Asked Questions for more details.
Examples
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool'] >>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t)) ... for t in dtypes]) >>> df = pd.DataFrame(data) >>> df.head() int64 float64 complex128 object bool 0 1 1.0 1.0+0.0j 1 True 1 1 1.0 1.0+0.0j 1 True 2 1 1.0 1.0+0.0j 1 True 3 1 1.0 1.0+0.0j 1 True 4 1 1.0 1.0+0.0j 1 True
>>> df.memory_usage() Index 128 int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
The memory footprint of object dtype columns is ignored by default:
>>> df.memory_usage(deep=True) Index 128 int64 40000 float64 40000 complex128 80000 object 180000 bool 5000 dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True) 5244
- transpose(*args, copy=False)[source]
Transpose index and columns.
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property
Tis an accessor to the methodtranspose().- Parameters:
- Returns:
The transposed DataFrame.
- Return type:
See also
numpy.transposePermute the dimensions of a given array.
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.
Examples
Square DataFrame with homogeneous dtype
>>> d1 = {'col1': [1, 2], 'col2': [3, 4]} >>> df1 = pd.DataFrame(data=d1) >>> df1 col1 col2 0 1 3 1 2 4
>>> df1_transposed = df1.T # or df1.transpose() >>> df1_transposed 0 1 col1 1 2 col2 3 4
When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with the same dtype:
>>> df1.dtypes col1 int64 col2 int64 dtype: object >>> df1_transposed.dtypes 0 int64 1 int64 dtype: object
Non-square DataFrame with mixed dtypes
>>> d2 = {'name': ['Alice', 'Bob'], ... 'score': [9.5, 8], ... 'employed': [False, True], ... 'kids': [0, 0]} >>> df2 = pd.DataFrame(data=d2) >>> df2 name score employed kids 0 Alice 9.5 False 0 1 Bob 8.0 True 0
>>> df2_transposed = df2.T # or df2.transpose() >>> df2_transposed 0 1 name Alice Bob score 9.5 8.0 employed False True kids 0 0
When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:
>>> df2.dtypes name object score float64 employed bool kids int64 dtype: object >>> df2_transposed.dtypes 0 object 1 object dtype: object
- property T: DataFrame
The transpose of the DataFrame.
- Returns:
The transposed DataFrame.
- Return type:
See also
DataFrame.transposeTranspose index and columns.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4
>>> df.T 0 1 col1 1 2 col2 3 4
- isetitem(loc, value)[source]
Set the given value in the column with position loc.
This is a positional analogue to
__setitem__.- Parameters:
loc (int or sequence of ints) – Index position for the column.
value (scalar or arraylike) – Value(s) for the column.
- Return type:
None
Notes
frame.isetitem(loc, value)is an in-place method as it will modify the DataFrame in place (not returning a new object). In contrast toframe.iloc[:, i] = valuewhich will try to update the existing values in place,frame.isetitem(loc, value)will not update the values of the column itself in place, it will instead insert a new array.In cases where
frame.columnsis unique, this is equivalent toframe[frame.columns[i]] = value.
- query(expr: str, *, inplace: Literal[False] = False, **kwargs) DataFrame[source]
- query(expr: str, *, inplace: Literal[True], **kwargs) None
- query(expr: str, *, inplace: bool = False, **kwargs) DataFrame | None
Query the columns of a DataFrame with a boolean expression.
- Parameters:
expr (str) –
The query string to evaluate.
You can refer to variables in the environment by prefixing them with an ‘@’ character like
@a + b.You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as
`Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.For example, if one of your columns is called
a aand you want to sum it withb, your query should be`a a` + b.inplace (bool) – Whether to modify the DataFrame rather than creating a new one.
**kwargs – See the documentation for
eval()for complete details on the keyword arguments accepted byDataFrame.query().
- Returns:
DataFrame resulting from the provided query expression or None if
inplace=True.- Return type:
DataFrame or None
See also
evalEvaluate a string describing operations on DataFrame columns.
DataFrame.evalEvaluate a string describing operations on DataFrame columns.
Notes
The result of the evaluation of this expression is first passed to
DataFrame.locand if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed toDataFrame.__getitem__().This method uses the top-level
eval()function to evaluate the passed query.The
query()method uses a slightly modified Python syntax by default. For example, the&and|(bitwise) operators have the precedence of their boolean cousins,andandor. This is syntactically valid Python, however the semantics are different.You can change the semantics of the expression by passing the keyword argument
parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can passengine='python'to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to usingnumexpras the engine.The
DataFrame.indexandDataFrame.columnsattributes of theDataFrameinstance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifierindexis used for the frame index; you can also use the name of the index to identify it in a query. Please note that Python keywords may not be used as identifiers.For further details and examples see the
querydocumentation in indexing.Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid identifier. This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings that are allowed as a Python identifier. These characters include all operators in Python, the space character, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the query parser will raise an error. This excludes whitespace different than the space character, but also the hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can confuse the parser. For example,
`it's` > `that's`will raise an error, as it forms a quoted string ('s > `that') with a backtick inside.See also the Python documentation about lexical analysis (https://docs.python.org/3/reference/lexical_analysis.html) in combination with the source code in
pandas.core.computation.parsing.Examples
>>> df = pd.DataFrame({'A': range(1, 6), ... 'B': range(10, 0, -2), ... 'C C': range(10, 5, -1)}) >>> df A B C C 0 1 10 10 1 2 8 9 2 3 6 8 3 4 4 7 4 5 2 6 >>> df.query('A > B') A B C C 4 5 2 6
The previous expression is equivalent to
>>> df[df.A > df.B] A B C C 4 5 2 6
For columns with spaces in their name, you can use backtick quoting.
>>> df.query('B == `C C`') A B C C 0 1 10 10
The previous expression is equivalent to
>>> df[df.B == df['C C']] A B C C 0 1 10 10
- eval(expr: str, *, inplace: Literal[False] = False, **kwargs) Any[source]
- eval(expr: str, *, inplace: Literal[True], **kwargs) None
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
- Parameters:
expr (str) – The expression string to evaluate.
inplace (bool, default False) – If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
**kwargs – See the documentation for
eval()for complete details on the keyword arguments accepted byquery().
- Returns:
The result of the evaluation or None if
inplace=True.- Return type:
ndarray, scalar, pandas object, or None
See also
DataFrame.queryEvaluates a boolean expression to query the columns of a frame.
DataFrame.assignCan evaluate an expression or function to create new values for a column.
evalEvaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval(). For detailed examples see enhancing performance with eval.Examples
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') 0 11 1 10 2 9 3 8 4 7 dtype: int64
Assignment is allowed though by default the original DataFrame is not modified.
>>> df.eval('C = A + B') A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2
Multiple columns can be assigned to using multi-line expressions:
>>> df.eval( ... ''' ... C = A + B ... D = A - B ... ''' ... ) A B C D 0 1 10 11 -9 1 2 8 10 -6 2 3 6 9 -3 3 4 4 8 0 4 5 2 7 3
- select_dtypes(include=None, exclude=None)[source]
Return a subset of the DataFrame’s columns based on the column dtypes.
- Parameters:
include (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
exclude (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
- Returns:
The subset of the frame including the dtypes in
includeand excluding the dtypes inexclude.- Return type:
- Raises:
If both of
includeandexcludeare empty * Ifincludeandexcludehave overlapping elements * If any kind of string dtype is passed in.
See also
DataFrame.dtypesReturn Series with the data type of each column.
Notes
To select all numeric types, use
np.numberor'number'To select strings you must use the
objectdtype, but note that this will return all object dtype columnsSee the numpy dtype hierarchy
To select datetimes, use
np.datetime64,'datetime'or'datetime64'To select timedeltas, use
np.timedelta64,'timedelta'or'timedelta64'To select Pandas categorical dtypes, use
'category'To select Pandas datetimetz dtypes, use
'datetimetz'(new in 0.20.0) or'datetime64[ns, tz]'
Examples
>>> df = pd.DataFrame({'a': [1, 2] * 3, ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 1 True 1.0 1 2 False 2.0 2 1 True 1.0 3 2 False 2.0 4 1 True 1.0 5 2 False 2.0
>>> df.select_dtypes(include='bool') b 0 True 1 False 2 True 3 False 4 True 5 False
>>> df.select_dtypes(include=['float64']) c 0 1.0 1 2.0 2 1.0 3 2.0 4 1.0 5 2.0
>>> df.select_dtypes(exclude=['int64']) b c 0 True 1.0 1 False 2.0 2 True 1.0 3 False 2.0 4 True 1.0 5 False 2.0
- insert(loc, column, value, allow_duplicates=_NoDefault.no_default)[source]
Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
- Parameters:
- Return type:
None
See also
Index.insertInsert new item by index.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 >>> df.insert(1, "newcol", [99, 99]) >>> df col1 newcol col2 0 1 99 3 1 2 99 4 >>> df.insert(0, "col1", [100, 100], allow_duplicates=True) >>> df col1 col1 newcol col2 0 100 1 99 3 1 100 2 99 4
Notice that pandas uses index alignment in case of value from type Series:
>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2])) >>> df col0 col1 col1 newcol col2 0 NaN 100 1 99 3 1 5.0 100 2 99 4
- assign(**kwargs)[source]
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
- Returns:
A new DataFrame with the new columns in addition to all the existing columns.
- Return type:
Notes
Assigning multiple columns within the same
assignis possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on df:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
- align(other, join='outer', axis=None, level=None, copy=None, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)[source]
Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
- Parameters:
join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit.
broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
- Returns:
Aligned objects.
- Return type:
Examples
>>> df = pd.DataFrame( ... [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2] ... ) >>> other = pd.DataFrame( ... [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]], ... columns=["A", "B", "C", "D"], ... index=[2, 3, 4], ... ) >>> df D B E A 1 1 2 3 4 2 6 7 8 9 >>> other A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Align on columns:
>>> left, right = df.align(other, join="outer", axis=1) >>> left A B C D E 1 4 2 NaN 1 3 2 9 7 NaN 6 8 >>> right A B C D E 2 10 20 30 40 NaN 3 60 70 80 90 NaN 4 600 700 800 900 NaN
We can also align on the index:
>>> left, right = df.align(other, join="outer", axis=0) >>> left D B E A 1 1.0 2.0 3.0 4.0 2 6.0 7.0 8.0 9.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN >>> right A B C D 1 NaN NaN NaN NaN 2 10.0 20.0 30.0 40.0 3 60.0 70.0 80.0 90.0 4 600.0 700.0 800.0 900.0
Finally, the default axis=None will align on both index and columns:
>>> left, right = df.align(other, join="outer", axis=None) >>> left A B C D E 1 4.0 2.0 NaN 1.0 3.0 2 9.0 7.0 NaN 6.0 8.0 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN >>> right A B C D E 1 NaN NaN NaN NaN NaN 2 10.0 20.0 30.0 40.0 NaN 3 60.0 70.0 80.0 90.0 NaN 4 600.0 700.0 800.0 900.0 NaN
- set_axis(labels, *, axis=0, copy=None)[source]
Assign desired index to given axis.
Indexes for column or row labels can be changed by assigning a list-like or Index.
- Parameters:
labels (list-like, Index) – The values for the new index.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows. For Series this parameter is unused and defaults to 0.
copy (bool, default True) –
Whether to make a copy of the underlying data.
New in version 1.5.0.
- Returns:
An object of type DataFrame.
- Return type:
See also
DataFrame.rename_axisAlter the name of the index or columns. Examples ——– >>> df = pd.DataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6
- reindex(labels=None, *, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)[source]
Conform DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and
copy=False.- Parameters:
labels (array-like, optional) – New labels / index to conform the axis specified by ‘axis’ to.
index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.
columns (array-like, optional) – New labels for the columns. Preferably an Index object to avoid duplicating data.
axis (int or str, optional) – Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Return type:
DataFrame with changed index.
See also
DataFrame.set_indexSet row labels.
DataFrame.reset_indexRemove row labels or move them to new columns.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
DataFrame.reindexsupports two calling conventions(index=index_labels, columns=column_labels, ...)(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN.>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethodto fill theNaNvalues.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN. If desired, we can fill in the missing values using one of several options.For example, to back-propagate the last valid value to fill the
NaNvalues, passbfillas an argument to themethodkeyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
Please note that the
NaNvalue present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaNvalues present in the original dataframe, use thefillna()method.See the user guide for more.
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: Literal[True], errors: Literal['ignore', 'raise'] = 'raise') None[source]
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: Literal[False] = False, errors: Literal['ignore', 'raise'] = 'raise') DataFrame
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: bool = False, errors: Literal['ignore', 'raise'] = 'raise') DataFrame | None
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide for more information about the now unused levels.
- Parameters:
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index (single label or list-like) – Alternative to specifying axis (
labels, axis=0is equivalent toindex=labels).columns (single label or list-like) – Alternative to specifying axis (
labels, axis=1is equivalent tocolumns=labels).level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
- Returns:
DataFrame without the removed index or column labels or None if
inplace=True.- Return type:
DataFrame or None
- Raises:
KeyError – If any of the labels is not found in the selected axis.
See also
DataFrame.locLabel-location based indexer for selection by label.
DataFrame.dropnaReturn DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicatesReturn DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.dropReturn Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the combination
'falcon'and'weight', which deletes only the corresponding row>>> df.drop(index=('falcon', 'weight')) big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 length 0.3 0.2
>>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
- rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: Literal[True], level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') None[source]
- rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: Literal[False] = False, level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') DataFrame
- rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: bool = False, level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') DataFrame | None
Rename columns or index labels.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
- Parameters:
mapper (dict-like or function) – Dict-like or function transformations to apply to that axis’ values. Use either
mapperandaxisto specify the axis to target withmapper, orindexandcolumns.index (dict-like or function) – Alternative to specifying axis (
mapper, axis=0is equivalent toindex=mapper).columns (dict-like or function) – Alternative to specifying axis (
mapper, axis=1is equivalent tocolumns=mapper).axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with
mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.copy (bool, default True) – Also copy underlying data.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.
level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified level.
errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
- Returns:
DataFrame with the renamed axis labels or None if
inplace=True.- Return type:
DataFrame or None
- Raises:
KeyError – If any of the labels is not found in the selected axis and “errors=’raise’”.
See also
DataFrame.rename_axisSet the name of the axis.
Examples
DataFrame.renamesupports two calling conventions(index=index_mapper, columns=columns_mapper, ...)(mapper, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Rename columns using a mapping:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6
Rename index using a mapping:
>>> df.rename(index={0: "x", 1: "y", 2: "z"}) A B x 1 4 y 2 5 z 3 6
Cast index labels to a different type:
>>> df.index RangeIndex(start=0, stop=3, step=1) >>> df.rename(index=str).index Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise") Traceback (most recent call last): KeyError: ['C'] not found in axis
Using axis-style parameters:
>>> df.rename(str.lower, axis='columns') a b 0 1 4 1 2 5 2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index') A B 0 1 4 2 2 5 4 3 6
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[False] = False, limit: int | None = None, downcast: dict | None = None) DataFrame[source]
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[True], limit: int | None = None, downcast: dict | None = None) None
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: bool = False, limit: int | None = None, downcast: dict | None = None) DataFrame | None
Fill NA/NaN values using the specified method.
- Parameters:
value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method ({'backfill', 'bfill', 'ffill', None}, default None) –
Method to use for filling holes in reindexed Series:
ffill: propagate last valid observation forward to next valid.
backfill / bfill: use next valid observation to fill gap.
axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.
inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
DataFrame or None
See also
interpolateFill NaN values using interpolation.
reindexConform object to new index.
asfreqConvert TimeSeries to specified frequency.
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, np.nan], ... [np.nan, 3, np.nan, 4]], ... columns=list("ABCD")) >>> df A B C D 0 NaN 2.0 NaN 0.0 1 3.0 4.0 NaN 1.0 2 NaN NaN NaN NaN 3 NaN 3.0 NaN 4.0
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0.0 1 3.0 4.0 0.0 1.0 2 0.0 0.0 0.0 0.0 3 0.0 3.0 0.0 4.0
We can also propagate non-null values forward or backward.
>>> df.fillna(method="ffill") A B C D 0 NaN 2.0 NaN 0.0 1 3.0 4.0 NaN 1.0 2 3.0 4.0 NaN 1.0 3 3.0 3.0 NaN 4.0
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {"A": 0, "B": 1, "C": 2, "D": 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0.0 1 3.0 4.0 2.0 1.0 2 0.0 1.0 2.0 3.0 3 0.0 3.0 2.0 4.0
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0.0 1 3.0 4.0 NaN 1.0 2 NaN 1.0 NaN 3.0 3 NaN 3.0 NaN 4.0
When filling using a DataFrame, replacement happens along the same column names and same indices
>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) >>> df.fillna(df2) A B C D 0 0.0 2.0 0.0 0.0 1 3.0 4.0 0.0 1.0 2 0.0 0.0 0.0 NaN 3 0.0 3.0 0.0 4.0
Note that column D is not affected since it is not present in df2.
- pop(item)[source]
Return item and drop from frame. Raise KeyError if not found.
- Parameters:
item (label) – Label of column to be popped.
- Return type:
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
>>> df.pop('class') 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object
>>> df name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
- replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[False] = False, limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) DataFrame[source]
- replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[True], limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) None
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with
.locor.iloc, which require you to specify a location to update with some value.- Parameters:
to_replace (str, regex, list, dict, Series, int, float, or None) –
How to find the values that will be replaced.
numeric, str or regex:
numeric: numeric values equal to to_replace will be replaced with value
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
First, if to_replace and value are both lists, they must be the same length.
Second, if
regex=Truethen all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values for different existing values. For example,
{'a': 'b', 'y': 'z'}replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.For a DataFrame a dict can specify that different values should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNonein this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
None:
This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also
Nonethen this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
limit (int, default None) – Maximum size gap to forward or backward fill.
regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular expressions. If this is
Truethen to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone.method ({'pad', 'ffill', 'bfill'}) – The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None.
- Returns:
Object after replacement.
- Return type:
- Raises:
If regex is not a
booland to_replace is notNone.
If to_replace is not a scalar, array-like,
dict, orNone* If to_replace is adictand value is not alist,dict,ndarray, orSeries* If to_replace isNoneand regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. * When replacing multipleboolordatetime64objects and the arguments to to_replace does not match the type of the value being replaced
If a
listor anndarrayis passed to to_replace and value but they are not the same length.
See also
DataFrame.fillnaFill NA values.
DataFrame.whereReplace values based on boolean condition.
Series.str.replaceSimple string replacement.
Notes
Regex substitution is performed under the hood with
re.sub. The rules for substitution forre.subare the same.Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples
Scalar `to_replace` and `value`
>>> s = pd.Series([1, 2, 3, 4, 5]) >>> s.replace(1, 5) 0 5 1 2 2 3 3 4 4 5 dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
List-like `to_replace`
>>> df.replace([0, 1, 2, 3], 4) A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e
>>> s.replace([1, 2], method='bfill') 0 3 1 3 2 3 3 4 4 5 dtype: int64
dict-like `to_replace`
>>> df.replace({0: 10, 1: 100}) A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100) A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}}) A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e
Regular expression `to_replace`
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) A B 0 new abc 1 foo bar 2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new') A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) A B 0 new abc 1 xyz new 2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new') A B 0 new abc 1 new new 2 bait xyz
Compare the behavior of
s.replace({'a': None})ands.replace('a', None)to understand the peculiarities of the to_replace parameter:>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter.
s.replace({'a': None})is equivalent tos.replace(to_replace={'a': None}, value=None, method=None):>>> s.replace({'a': None}) 0 10 1 None 2 None 3 b 4 None dtype: object
When
valueis not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.>>> s.replace('a') 0 10 1 10 2 10 3 b 4 b dtype: object
On the other hand, if
Noneis explicitly passed forvalue, it will be respected:>>> s.replace('a', None) 0 10 1 None 2 None 3 b 4 None dtype: object
Changed in version 1.4.0: Previously the explicit
Nonewas silently ignored.
- shift(periods=1, freq=None, axis=0, fill_value=_NoDefault.no_default)[source]
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
- Parameters:
periods (int) – Number of periods to shift. Can be positive or negative.
freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.
fill_value (object, optional) –
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data,
np.nanis used. For datetime, timedelta, or period data, etc.NaTis used. For extension dtypes,self.dtype.na_valueis used.Changed in version 1.1.0.
- Returns:
Copy of input object, shifted.
- Return type:
See also
Index.shiftShift values of Index.
DatetimeIndex.shiftShift values of DatetimeIndex.
PeriodIndex.shiftShift values of PeriodIndex.
Examples
>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], ... "Col2": [13, 23, 18, 33, 48], ... "Col3": [17, 27, 22, 37, 52]}, ... index=pd.date_range("2020-01-01", "2020-01-05")) >>> df Col1 Col2 Col3 2020-01-01 10 13 17 2020-01-02 20 23 27 2020-01-03 15 18 22 2020-01-04 30 33 37 2020-01-05 45 48 52
>>> df.shift(periods=3) Col1 Col2 Col3 2020-01-01 NaN NaN NaN 2020-01-02 NaN NaN NaN 2020-01-03 NaN NaN NaN 2020-01-04 10.0 13.0 17.0 2020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1, axis="columns") Col1 Col2 Col3 2020-01-01 NaN 10 13 2020-01-02 NaN 20 23 2020-01-03 NaN 15 18 2020-01-04 NaN 30 33 2020-01-05 NaN 45 48
>>> df.shift(periods=3, fill_value=0) Col1 Col2 Col3 2020-01-01 0 0 0 2020-01-02 0 0 0 2020-01-03 0 0 0 2020-01-04 10 13 17 2020-01-05 20 23 27
>>> df.shift(periods=3, freq="D") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
>>> df.shift(periods=3, freq="infer") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
- set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[False] = False, verify_integrity: bool = False) DataFrame[source]
- set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[True], verify_integrity: bool = False) None
Set the DataFrame index using existing columns.
Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.
- Parameters:
keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses
Series,Index,np.ndarray, and instances ofIterator.drop (bool, default True) – Delete columns to be used as the new index.
append (bool, default False) – Whether to append columns to existing index.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verify_integrity (bool, default False) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.
- Returns:
Changed row labels or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.reset_indexOpposite of set_index.
DataFrame.reindexChange to new indices or expand indices.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
>>> df = pd.DataFrame({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale': [55, 40, 84, 31]}) >>> df month year sale 0 1 2012 55 1 4 2014 40 2 7 2013 84 3 10 2014 31
Set the index to become the ‘month’ column:
>>> df.set_index('month') year sale month 1 2012 55 4 2014 40 7 2013 84 10 2014 31
Create a MultiIndex using columns ‘year’ and ‘month’:
>>> df.set_index(['year', 'month']) sale year month 2012 1 55 2014 4 40 2013 7 84 2014 10 31
Create a MultiIndex using an Index and a column:
>>> df.set_index([pd.Index([1, 2, 3, 4]), 'year']) month sale year 1 2012 1 55 2 2014 4 40 3 2013 7 84 4 2014 10 31
Create a MultiIndex using two Series:
>>> s = pd.Series([1, 2, 3, 4]) >>> df.set_index([s, s**2]) month year sale 1 1 1 2012 55 2 4 4 2014 40 3 9 7 2013 84 4 16 10 2014 31
- reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: ~typing.Literal[False] = False, col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) DataFrame[source]
- reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: ~typing.Literal[True], col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) None
- reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: bool = False, col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) DataFrame | None
Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
- Parameters:
level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
allow_duplicates (bool, optional, default lib.no_default) –
Allow duplicate column labels to be created.
New in version 1.5.0.
names (int, str or 1-dimensional list, default None) –
Using the given string, rename the DataFrame column which contains the index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.
New in version 1.5.0.
- Returns:
DataFrame with the new index or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.set_indexOpposite of reset_index.
DataFrame.reindexChange to new indices or expand indices.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN
You can also use reset_index with MultiIndex.
>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... (24.0, 'fly'), ... (80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump
Using the names parameter, choose a name for the index column:
>>> df.reset_index(names=['classes', 'names']) classes names speed species max type 0 bird falcon 389.0 fly 1 bird parrot 24.0 fly 2 mammal lion 80.5 run 3 mammal monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:
>>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
When the index is inserted under another level, we can specify under which one with the parameter col_fill:
>>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
- isna()[source]
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type:
See also
DataFrame.isnullAlias of isna.
DataFrame.notnaBoolean inverse of isna.
DataFrame.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- isnull()[source]
DataFrame.isnull is an alias for DataFrame.isna.
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type:
See also
DataFrame.isnullAlias of isna.
DataFrame.notnaBoolean inverse of isna.
DataFrame.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- notna()[source]
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
See also
DataFrame.notnullAlias of notna.
DataFrame.isnaBoolean inverse of notna.
DataFrame.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- notnull()[source]
DataFrame.notnull is an alias for DataFrame.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
See also
DataFrame.notnullAlias of notna.
DataFrame.isnaBoolean inverse of notna.
DataFrame.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- dropna(*, axis: int | ~typing.Literal['index', 'columns', 'rows'] = 0, how: ~typing.Literal['any', 'all'] | ~typing.Literal[<no_default>] = _NoDefault.no_default, thresh: int | ~typing.Literal[<no_default>] = _NoDefault.no_default, subset: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, inplace: ~typing.Literal[False] = False, ignore_index: bool = False) DataFrame[source]
- dropna(*, axis: int | ~typing.Literal['index', 'columns', 'rows'] = 0, how: ~typing.Literal['any', 'all'] | ~typing.Literal[<no_default>] = _NoDefault.no_default, thresh: int | ~typing.Literal[<no_default>] = _NoDefault.no_default, subset: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, inplace: ~typing.Literal[True], ignore_index: bool = False) None
Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’ : If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.New in version 2.0.0.
- Returns:
DataFrame with NA entries dropped from it or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.isnaIndicate missing values.
DataFrame.notnaIndicate existing (non-missing) values.
DataFrame.fillnaReplace missing values.
Series.dropnaDrop missing values.
Index.dropnaDrop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy']) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
- drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)[source]
Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
- Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
keep ({‘first’, ‘last’,
False}, default ‘first’) –Determines which duplicates (if any) to keep.
’first’ : Drop duplicates except for the first occurrence.
’last’ : Drop duplicates except for the last occurrence.
False: Drop all duplicates.
inplace (bool, default
False) – Whether to modify the DataFrame rather than creating a new one.ignore_index (bool, default
False) – IfTrue, the resulting axis will be labeled 0, 1, …, n - 1.
- Returns:
DataFrame with duplicates removed or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.value_countsCount unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
To remove duplicates on specific column(s), use
subset.>>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use
keep.>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
- duplicated(subset=None, keep='first')[source]
Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
- Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
keep ({'first', 'last', False}, default 'first') –
Determines which duplicates (if any) to mark.
first: Mark duplicates asTrueexcept for the first occurrence.last: Mark duplicates asTrueexcept for the last occurrence.False : Mark all duplicates as
True.
- Returns:
Boolean series for each duplicated rows.
- Return type:
See also
Index.duplicatedEquivalent method on index.
Series.duplicatedEquivalent method on Series.
Series.drop_duplicatesRemove duplicate values from Series.
DataFrame.drop_duplicatesRemove duplicate values from DataFrame.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting
keepon False, all duplicates are True.>>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column(s), use
subset.>>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
- sort_values(by: Hashable | Sequence[Hashable], *, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending=True, inplace: Literal[False] = False, kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) DataFrame[source]
- sort_values(by: Hashable | Sequence[Hashable], *, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending=True, inplace: Literal[True], kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) None
Sort by the values along either axis.
- Parameters:
Name or list of names to sort by.
if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to be sorted.
ascending (bool or list of bool, default True) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) –
Apply the key function to the values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect aSeriesand return a Series with the same shape as the input. It will be applied to each column in by independently.New in version 1.1.0.
- Returns:
DataFrame with sorted values or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.sort_indexSort a DataFrame by the index.
Series.sort_valuesSimilar method for a Series.
Examples
>>> df = pd.DataFrame({ ... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], ... 'col2': [2, 1, 9, 8, 7, 4], ... 'col3': [0, 1, 9, 4, 2, 3], ... 'col4': ['a', 'B', 'c', 'D', 'e', 'F'] ... }) >>> df col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Sort by col1
>>> df.sort_values(by=['col1']) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2']) col1 col2 col3 col4 1 A 1 1 B 0 A 2 0 a 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort Descending
>>> df.sort_values(by='col1', ascending=False) col1 col2 col3 col4 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B 3 NaN 8 4 D
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first') col1 col2 col3 col4 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B
Sorting with a key function
>>> df.sort_values(by='col4', key=lambda col: col.str.lower()) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Natural sort with the key argument, using the natsort <https://github.com/SethMMorton/natsort> package.
>>> df = pd.DataFrame({ ... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'], ... "value": [10, 20, 30, 40, 50] ... }) >>> df time value 0 0hr 10 1 128hr 20 2 72hr 30 3 48hr 40 4 96hr 50 >>> from natsort import index_natsorted >>> df.sort_values( ... by="time", ... key=lambda x: np.argsort(index_natsorted(df["time"])) ... ) time value 0 0hr 10 3 48hr 40 2 72hr 30 4 96hr 50 1 128hr 20
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) None[source]
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) DataFrame
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) DataFrame | None
Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is
False, otherwise updates the original DataFrame and returns None.- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape. For MultiIndex inputs, the key is applied per level.New in version 1.1.0.
- Returns:
The original DataFrame sorted by the labels or None if
inplace=True.- Return type:
DataFrame or None
See also
Series.sort_indexSort Series by the index.
DataFrame.sort_valuesSort DataFrame by the value.
Series.sort_valuesSort Series by the value.
Examples
>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], ... columns=['A']) >>> df.sort_index() A 1 4 29 2 100 1 150 5 234 3
By default, it sorts in ascending order, to sort in descending order, use
ascending=False>>> df.sort_index(ascending=False) A 234 3 150 5 100 1 29 2 1 4
A key function can be specified which is applied to the index before sorting. For a
MultiIndexthis is applied to each level separately.>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd']) >>> df.sort_index(key=lambda x: x.str.lower()) a A 1 b 2 C 3 d 4
- value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)[source]
Return a Series containing counts of unique rows in the DataFrame.
New in version 1.1.0.
- Parameters:
subset (label or list of labels, optional) – Columns to use when counting unique combinations.
normalize (bool, default False) – Return proportions rather than frequencies.
sort (bool, default True) – Sort by frequencies.
ascending (bool, default False) – Sort in ascending order.
dropna (bool, default True) –
Don’t include counts of rows that contain NA values.
New in version 1.3.0.
- Return type:
See also
Series.value_countsEquivalent method on Series.
Notes
The returned Series will have a MultiIndex with one level per input column but an Index (non-multi) for a single label. By default, rows that contain any NA values are omitted from the result. By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring row.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6], ... 'num_wings': [2, 0, 0, 0]}, ... index=['falcon', 'dog', 'cat', 'ant']) >>> df num_legs num_wings falcon 2 2 dog 4 0 cat 4 0 ant 6 0
>>> df.value_counts() num_legs num_wings 4 0 2 2 2 1 6 0 1 Name: count, dtype: int64
>>> df.value_counts(sort=False) num_legs num_wings 2 2 1 4 0 2 6 0 1 Name: count, dtype: int64
>>> df.value_counts(ascending=True) num_legs num_wings 2 2 1 6 0 1 4 0 2 Name: count, dtype: int64
>>> df.value_counts(normalize=True) num_legs num_wings 4 0 0.50 2 2 0.25 6 0 0.25 Name: proportion, dtype: float64
With dropna set to False we can also count rows with NA values.
>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'], ... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']}) >>> df first_name middle_name 0 John Smith 1 Anne <NA> 2 John <NA> 3 Beth Louise
>>> df.value_counts() first_name middle_name Beth Louise 1 John Smith 1 Name: count, dtype: int64
>>> df.value_counts(dropna=False) first_name middle_name Anne NaN 1 Beth Louise 1 John Smith 1 NaN 1 Name: count, dtype: int64
>>> df.value_counts("first_name") first_name John 2 Anne 1 Beth 1 Name: count, dtype: int64
- nlargest(n, columns, keep='first')[source]
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n), but more performant.- Parameters:
n (int) – Number of rows to return.
columns (label or list of labels) – Column label(s) to order by.
keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
first: prioritize the first occurrence(s)last: prioritize the last occurrence(s)all: do not drop any duplicates, even it means selecting more than n items.
- Returns:
The first n rows ordered by the given columns in descending order.
- Return type:
See also
DataFrame.nsmallestReturn the first n rows ordered by columns in ascending order.
DataFrame.sort_valuesSort DataFrame by the values.
DataFrame.headReturn the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeErroris raised.Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 11300, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 11300 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nlargestto select the three rows having the largest values in column “population”.>>> df.nlargest(3, 'population') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT
When using
keep='last', ties are resolved in reverse order:>>> df.nlargest(3, 'population', keep='last') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
When using
keep='all', all duplicate items are maintained:>>> df.nlargest(3, 'population', keep='all') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN
To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nlargest(3, ['population', 'GDP']) population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
- nsmallest(n, columns, keep='first')[source]
Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=True).head(n), but more performant.- Parameters:
n (int) – Number of items to retrieve.
keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
first: take the first occurrence.last: take the last occurrence.all: do not drop any duplicates, even it means selecting more than n items.
- Return type:
See also
DataFrame.nlargestReturn the first n rows ordered by columns in descending order.
DataFrame.sort_valuesSort DataFrame by the values.
DataFrame.headReturn the first n rows without re-ordering.
Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 337000, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 337000 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nsmallestto select the three rows having the smallest values in column “population”.>>> df.nsmallest(3, 'population') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS
When using
keep='last', ties are resolved in reverse order:>>> df.nsmallest(3, 'population', keep='last') population GDP alpha-2 Anguilla 11300 311 AI Tuvalu 11300 38 TV Nauru 337000 182 NR
When using
keep='all', all duplicate items are maintained:>>> df.nsmallest(3, 'population', keep='all') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS Nauru 337000 182 NR
To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nsmallest(3, ['population', 'GDP']) population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Nauru 337000 182 NR
- swaplevel(i=-2, j=-1, axis=0)[source]
Swap levels i and j in a
MultiIndex.Default is to swap the two innermost levels of the index.
- Parameters:
i (int or str) – Levels of the indices to be swapped. Can pass level name as string.
j (int or str) – Levels of the indices to be swapped. Can pass level name as string.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- Returns:
DataFrame with levels swapped in MultiIndex.
- Return type:
Examples
>>> df = pd.DataFrame( ... {"Grade": ["A", "B", "A", "C"]}, ... index=[ ... ["Final exam", "Final exam", "Coursework", "Coursework"], ... ["History", "Geography", "History", "Geography"], ... ["January", "February", "March", "April"], ... ], ... ) >>> df Grade Final exam History January A Geography February B Coursework History March A Geography April C
In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.
>>> df.swaplevel() Grade Final exam January History A February Geography B Coursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.
>>> df.swaplevel(0) Grade January History Final exam A February Geography Final exam B March History Coursework A April Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0, 1) Grade History Final exam January A Geography Final exam February B History Coursework March A Geography Coursework April C
- reorder_levels(order, axis=0)[source]
Rearrange index levels using input order. May not drop or duplicate levels.
- Parameters:
- Return type:
Examples
>>> data = { ... "class": ["Mammals", "Mammals", "Reptiles"], ... "diet": ["Omnivore", "Carnivore", "Carnivore"], ... "species": ["Humans", "Dogs", "Snakes"], ... } >>> df = pd.DataFrame(data, columns=["class", "diet", "species"]) >>> df = df.set_index(["class", "diet"]) >>> df species class diet Mammals Omnivore Humans Carnivore Dogs Reptiles Carnivore Snakes
Let’s reorder the levels of the index:
>>> df.reorder_levels(["diet", "class"]) species diet class Omnivore Mammals Humans Carnivore Mammals Dogs Reptiles Snakes
- compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))[source]
Compare to another DataFrame and show the differences.
New in version 1.1.0.
- Parameters:
other (DataFrame) – Object to compare with.
align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
- 1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
result_names (tuple, default ('self', 'other')) –
Set the dataframes names in the comparison.
New in version 1.5.0.
- Returns:
DataFrame that shows the differences stacked side by side.
The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
- Return type:
- Raises:
ValueError – When the two DataFrames don’t have identical labels or shape.
See also
Series.compareCompare with another Series and show differences.
DataFrame.equalsTest whether two objects contain the same elements.
Notes
Matching NaNs will not appear as a difference.
Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
Examples
>>> df = pd.DataFrame( ... { ... "col1": ["a", "a", "b", "b", "a"], ... "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ... "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ... }, ... columns=["col1", "col2", "col3"], ... ) >>> df col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0
>>> df2 = df.copy() >>> df2.loc[0, 'col1'] = 'c' >>> df2.loc[2, 'col3'] = 4.0 >>> df2 col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0
Assign result_names
>>> df.compare(df2, result_names=("left", "right")) col1 col3 left right left right 0 a c NaN NaN 2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2, align_axis=0) col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2, keep_equal=True) col1 col3 self other self other 0 a c 1.0 1.0 2 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2, keep_shape=True) col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2, keep_shape=True, keep_equal=True) col1 col2 col3 self other self other self other 0 a c 1.0 1.0 1.0 1.0 1 a a 2.0 2.0 2.0 2.0 2 b b 3.0 3.0 3.0 4.0 3 b b NaN NaN 4.0 4.0 4 a a 5.0 5.0 5.0 5.0
- combine(other, func, fill_value=None, overwrite=True)[source]
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
- Parameters:
other (DataFrame) – The DataFrame to merge column-wise.
func (function) – Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
- Returns:
Combination of the provided DataFrames.
- Return type:
See also
DataFrame.combine_firstCombine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3
Example using a true element-wise combine function.
>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3
Using fill_value fills Nones prior to passing the column to the merge function.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0
However, if the same element in both dataframes is None, that None is preserved
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0
Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0
>>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- combine_first(other)[source]
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).
- Parameters:
other (DataFrame) – Provided DataFrame to use to fill null values.
- Returns:
The result of combining the provided DataFrame with the other object.
- Return type:
See also
DataFrame.combinePerform series-wise operation on two DataFrames using a given function.
Examples
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0
Null values still persist if the location of that null value does not exist in other
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- update(other, join='left', overwrite=True, filter_func=None, errors='ignore')[source]
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
- Parameters:
other (DataFrame, or object coercible into a DataFrame) – Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.
join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.
overwrite (bool, default True) –
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.
errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.
- Returns:
This method directly changes calling object.
- Return type:
None
- Raises:
When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’
If join != ‘left’
See also
dict.updateSimilar method for dictionaries.
DataFrame.mergeFor column(s)-on-column(s) operations.
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, 5, 6], ... 'C': [7, 8, 9]}) >>> df.update(new_df) >>> df A B 0 1 4 1 2 5 2 3 6
The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}) >>> df.update(new_df) >>> df A B 0 a d 1 b e 2 c f
For Series, its name attribute must be set.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2]) >>> df.update(new_column) >>> df A B 0 a d 1 b y 2 c e >>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2]) >>> df.update(new_df) >>> df A B 0 a x 1 b d 2 c e
If other contains NaNs the corresponding values are not updated in the original dataframe.
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, np.nan, 6]}) >>> df.update(new_df) >>> df A B 0 1 4 1 2 500 2 3 6
- groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)[source]
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters:
by (mapping, function, label, pd.Grouper or list of such) – Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.
level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both
byandlevel.as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sort (bool, default True) –
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Changed in version 2.0.0: Specifying
sort=Falsewith an ordered categorical grouper will no longer sort the values.group_keys (bool, default True) –
When calling apply and the
byargument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.Changed in version 1.5.0: Warns that
group_keyswill no longer be ignored when the result fromapplyis a like-indexed Series or DataFrame. Specifygroup_keysexplicitly to include the group keys or not.Changed in version 2.0.0:
group_keysnow defaults toTrue.observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
New in version 1.1.0.
- Returns:
Returns a groupby object that contains information about the groups.
- Return type:
DataFrameGroupBy
See also
resampleConvenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Examples
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() Max Speed Type Captive 210.0 Wild 185.0
We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum() a c b 1.0 2 3 2.0 2 5
>>> df.groupby(by=["b"], dropna=False).sum() a c b 1.0 2 3 2.0 2 5 NaN 1 4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum() b c a a 13.0 13.0 b 12.3 123.0
>>> df.groupby(by="a", dropna=False).sum() b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
When using
.apply(), usegroup_keysto include or exclude the group keys. Thegroup_keysargument defaults toTrue(include).>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df.groupby("Animal", group_keys=True).apply(lambda x: x) Animal Max Speed Animal Falcon 0 Falcon 380.0 1 Falcon 370.0 Parrot 2 Parrot 24.0 3 Parrot 26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x) Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0
- pivot(*, columns, index=typing.Literal[<no_default>], values=typing.Literal[<no_default>])[source]
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.
- Parameters:
columns (str or object or a list of str) –
Column to use to make new frame’s columns.
Changed in version 1.1.0: Also accept list of columns names.
index (str or object or a list of str, optional) –
Column to use to make new frame’s index. If not given, uses existing index.
Changed in version 1.1.0: Also accept list of index names.
values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
- Returns:
Returns reshaped DataFrame.
- Return type:
- Raises:
ValueError: – When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.
See also
DataFrame.pivot_tableGeneralization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstackPivot based on the index values instead of a column.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', ... 'two'], ... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], ... 'baz': [1, 2, 3, 4, 5, 6], ... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) >>> df foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
>>> df.pivot(index='foo', columns='bar', values='baz') bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar')['baz'] bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo']) baz zoo bar A B C A B C foo one 1 2 3 x y z two 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df = pd.DataFrame({ ... "lev1": [1, 1, 1, 2, 2, 2], ... "lev2": [1, 1, 2, 1, 1, 2], ... "lev3": [1, 2, 1, 2, 1, 2], ... "lev4": [1, 2, 3, 4, 5, 6], ... "values": [0, 1, 2, 3, 4, 5]}) >>> df lev1 lev2 lev3 lev4 values 0 1 1 1 1 0 1 1 1 2 2 1 2 1 2 1 3 2 3 2 1 2 4 3 4 2 1 1 5 4 5 2 2 2 6 5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values") lev2 1 2 lev3 1 2 1 2 lev1 1 0.0 1.0 2.0 NaN 2 4.0 3.0 NaN 5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values") lev3 1 2 lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'A', 'B', 'C'], ... "baz": [1, 2, 3, 4]}) >>> df foo bar baz 0 one A 1 1 one A 2 2 two B 3 3 two C 4
Notice that the first two rows are the same for our index and columns arguments.
>>> df.pivot(index='foo', columns='bar', values='baz') Traceback (most recent call last): ... ValueError: Index contains duplicate entries, cannot reshape
- pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
- Parameters:
values (list-like or scalar, optional) – Column or columns to aggregate.
index (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc (function, list of functions, dict, default numpy.mean) – If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions. If
margin=True, aggfunc will be used to calculate the partial aggregates.fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table, after aggregation).
margins (bool, default False) – If
margins=True, specialAllcolumns and rows will be added with partial group aggregates across the categories on the rows and columns.dropna (bool, default True) – Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
margins_name (str, default 'All') – Name of the row / column that will contain the totals when margins is True.
observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
sort (bool, default True) –
Specifies if the result should be sorted.
New in version 1.3.0.
- Returns:
An Excel style pivot table.
- Return type:
See also
DataFrame.pivotPivot without aggregation that can handle non-numeric data.
DataFrame.meltUnpivot a DataFrame from wide to long format, optionally leaving identifiers set.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", ... "bar", "bar", "bar", "bar"], ... "B": ["one", "one", "one", "two", "two", ... "one", "one", "two", "two"], ... "C": ["small", "large", "large", "small", ... "small", "large", "small", "small", ... "large"], ... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7], ... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]}) >>> df A B C D E 0 foo one small 1 2 1 foo one large 2 4 2 foo one large 2 5 3 foo two small 3 5 4 foo two small 3 6 5 bar one large 4 6 6 bar one small 5 8 7 bar two small 6 9 8 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc=np.sum) >>> table C large small A B bar one 4.0 5.0 two 7.0 6.0 foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc=np.sum, fill_value=0) >>> table C large small A B bar one 4 5 two 7 6 foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': np.mean, 'E': np.mean}) >>> table D E A C bar large 5.500000 7.500000 small 5.500000 8.500000 foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given value column.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': np.mean, ... 'E': [min, max, np.mean]}) >>> table D E mean max mean min A C bar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8 foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
- stack(level=-1, dropna=True)[source]
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
if the columns have a single level, the output is a Series;
if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.
- Parameters:
level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.
dropna (bool, default True) – Whether to drop rows in the resulting Frame/Series with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.
- Returns:
Stacked dataframe or series.
- Return type:
See also
DataFrame.unstackUnstack prescribed level(s) from index axis onto column axis.
DataFrame.pivotReshape dataframe from long format to wide format.
DataFrame.pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.
Notes
The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).
Reference the user guide for more examples.
Examples
Single level columns
>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]], ... index=['cat', 'dog'], ... columns=['weight', 'height'])
Stacking a dataframe with a single level column axis returns a Series:
>>> df_single_level_cols weight height cat 0 1 dog 2 3 >>> df_single_level_cols.stack() cat weight 0 height 1 dog weight 2 height 3 dtype: int64
Multi level columns: simple case
>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('weight', 'pounds')]) >>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]], ... index=['cat', 'dog'], ... columns=multicol1)
Stacking a dataframe with a multi-level column axis:
>>> df_multi_level_cols1 weight kg pounds cat 1 2 dog 2 4 >>> df_multi_level_cols1.stack() weight cat kg 1 pounds 2 dog kg 2 pounds 4
Missing values
>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('height', 'm')]) >>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], ... index=['cat', 'dog'], ... columns=multicol2)
It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:
>>> df_multi_level_cols2 weight height kg m cat 1.0 2.0 dog 3.0 4.0 >>> df_multi_level_cols2.stack() height weight cat kg NaN 1.0 m 2.0 NaN dog kg NaN 3.0 m 4.0 NaN
Prescribing the level(s) to be stacked
The first parameter controls which level or levels are stacked:
>>> df_multi_level_cols2.stack(0) kg m cat height NaN 2.0 weight 1.0 NaN dog height NaN 4.0 weight 3.0 NaN >>> df_multi_level_cols2.stack([0, 1]) cat height m 2.0 weight kg 1.0 dog height m 4.0 weight kg 3.0 dtype: float64
Dropping missing values
>>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]], ... index=['cat', 'dog'], ... columns=multicol2)
Note that rows where all values are missing are dropped by default but this behaviour can be controlled via the dropna keyword parameter:
>>> df_multi_level_cols3 weight height kg m cat NaN 1.0 dog 2.0 3.0 >>> df_multi_level_cols3.stack(dropna=False) height weight cat kg NaN NaN m 1.0 NaN dog kg NaN 2.0 m 3.0 NaN >>> df_multi_level_cols3.stack(dropna=True) height weight cat m 1.0 NaN dog kg NaN 2.0 m 3.0 NaN
- explode(column, ignore_index=False)[source]
Transform each element of a list-like to a row, replicating index values.
- Parameters:
column (IndexLabel) –
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
New in version 1.3.0: Multi-column explode
ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
- Returns:
Exploded lists to rows of the subset columns; index will be duplicated for these rows.
- Return type:
- Raises:
ValueError : –
If columns of the frame are not unique. * If specified columns to explode is empty list. * If specified columns to explode have not matching count of elements rowwise in the frame.
See also
DataFrame.unstackPivot a level of the (necessarily hierarchical) index labels.
DataFrame.meltUnpivot a DataFrame from wide format to long format.
Series.explodeExplode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]], ... 'B': 1, ... 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]}) >>> df A B C 0 [0, 1, 2] 1 [a, b, c] 1 foo 1 NaN 2 [] 1 [] 3 [3, 4] 1 [d, e]
Single-column explode.
>>> df.explode('A') A B C 0 0 1 [a, b, c] 0 1 1 [a, b, c] 0 2 1 [a, b, c] 1 foo 1 NaN 2 NaN 1 [] 3 3 1 [d, e] 3 4 1 [d, e]
Multi-column explode.
>>> df.explode(list('AC')) A B C 0 0 1 a 0 1 1 b 0 2 1 c 1 foo 1 NaN 2 NaN 1 NaN 3 3 1 d 3 4 1 e
- unstack(level=-1, fill_value=None)[source]
Pivot a level of the (necessarily hierarchical) index labels.
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).
- Parameters:
- Return type:
See also
DataFrame.pivotPivot a table based on column values.
DataFrame.stackPivot a level of the column labels (inverse operation from unstack).
Notes
Reference the user guide for more examples.
Examples
>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ... ('two', 'a'), ('two', 'b')]) >>> s = pd.Series(np.arange(1.0, 5.0), index=index) >>> s one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
>>> s.unstack(level=-1) a b one 1.0 2.0 two 3.0 4.0
>>> s.unstack(level=0) one two a 1.0 3.0 b 2.0 4.0
>>> df = s.unstack(level=0) >>> df.unstack() one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
- melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)[source]
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name (scalar) – Name to use for the ‘variable’ column. If None it uses
frame.columns.nameor ‘variable’.value_name (scalar, default 'value') – Name to use for the ‘value’ column.
col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.
ignore_index (bool, default True) –
If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
New in version 1.1.0.
- Returns:
Unpivoted DataFrame.
- Return type:
See also
meltIdentical method.
pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivotReturn reshaped DataFrame organized by given index / column values.
DataFrame.explodeExplode a DataFrame from list-like columns to long format.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=['A'], value_vars=['B', 'C']) A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6
The names of ‘variable’ and ‘value’ columns can be customized:
>>> df.melt(id_vars=['A'], value_vars=['B'], ... var_name='myVarname', value_name='myValname') A myVarname myValname 0 a B 1 1 b B 3 2 c B 5
Original index values can be kept around:
>>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False) A variable value 0 a B 1 1 b B 3 2 c B 5 0 a C 2 1 b C 4 2 c C 6
If you have multi-index columns:
>>> df.columns = [list('ABC'), list('DEF')] >>> df A B C D E F 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')]) (A, D) variable_0 variable_1 value 0 a B E 1 1 b B E 3 2 c B E 5
- diff(periods=1, axis=0)[source]
First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).
- Parameters:
periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Take difference over rows (0) or columns (1).
- Returns:
First differences of the Series.
- Return type:
See also
DataFrame.pct_changePercent change over given number of periods.
DataFrame.shiftShift index by desired number of periods with an optional time freq.
Series.diffFirst discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()rather thanoperator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28
Difference with 3rd previous row
>>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
Overflow in input dtype
>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) >>> df.diff() a 0 NaN 1 255.0
- aggregate(func=None, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
scalar, Series or DataFrame – The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
See also
DataFrame.applyPerform any type of operations.
DataFrame.transformPerform transformation type operations.
core.groupby.GroupByPerform operations over groups.
core.resample.ResamplerPerform operations over resampled bins.
core.window.RollingPerform operations over rolling window.
core.window.ExpandingPerform operations over expanding window.
core.window.ExponentialMovingWindowPerform operation over exponential weighted window.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- agg(func=None, axis=0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
scalar, Series or DataFrame – The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
See also
DataFrame.applyPerform any type of operations.
DataFrame.transformPerform transformation type operations.
core.groupby.GroupByPerform operations over groups.
core.resample.ResamplerPerform operations over resampled bins.
core.window.RollingPerform operations over rolling window.
core.window.ExpandingPerform operations over expanding window.
core.window.ExponentialMovingWindowPerform operation over exponential weighted window.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: None = ..., **kwargs) Series
- any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: Hashable, **kwargs) DataFrame | Series
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type:
See also
numpy.anyNumpy version of this method.
Series.anyReturn whether any element is True.
Series.allReturn whether all elements are True.
DataFrame.anyReturn whether any element is True over requested axis.
DataFrame.allReturn whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None.>>> df.any(axis=None) True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() Series([], dtype: bool)
- transform(func, axis=0, *args, **kwargs)[source]
Call
funcon self producing a DataFrame with the same axis shape as self.- Parameters:
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
A DataFrame that must have the same length as self.
- Return type:
:raises ValueError : If the returned DataFrame has a different length than self.:
See also
DataFrame.aggOnly perform aggregating type operations.
DataFrame.applyInvoke function on a DataFrame.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4
Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions:
>>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056
You can call transform on a GroupBy object:
>>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
>>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)[source]
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.- Parameters:
func (function) – Function to apply to each column or row.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
False: passes each row or column as a Series to the function.True: the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
result_type ({'expand', 'reduce', 'broadcast', None}, default None) –
These only act when
axis=1(columns):’expand’ : list-like results will be turned into columns.
’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
args (tuple) – Positional arguments to pass to func in addition to the array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Result of applying
funcalong the given axis of the DataFrame.- Return type:
See also
DataFrame.applymapFor elementwise operations.
DataFrame.aggregateOnly perform aggregating type operations.
DataFrame.transformOnly perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9
Using a numpy universal function (in this case the same as
np.sqrt(df)):>>> df.apply(np.sqrt) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0
Using a reducing function on either axis
>>> df.apply(np.sum, axis=0) A 12 B 27 dtype: int64
>>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64
Returning a list-like will result in a Series
>>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object
Passing
result_type='expand'will expand list-like results to columns of a Dataframe>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand') 0 1 0 1 2 1 1 2 2 1 2
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series index.>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1) foo bar 0 1 2 1 1 2 2 1 2
Passing
result_type='broadcast'will ensure the same shape result, whether list-like or scalar is returned by the function, and broadcast it along the axis. The resulting column names will be the originals.>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast') A B 0 1 2 1 1 2 2 1 2
- applymap(func, na_action=None, **kwargs)[source]
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
- Parameters:
func (callable) – Python function, returns a single value from a single value.
na_action ({None, 'ignore'}, default None) –
If ‘ignore’, propagate NaN values, without passing them to func.
New in version 1.2.
**kwargs –
Additional keyword arguments to pass as keywords arguments to func.
New in version 1.3.0.
- Returns:
Transformed DataFrame.
- Return type:
See also
DataFrame.applyApply a function along input axis of DataFrame.
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) >>> df 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.applymap(lambda x: len(str(x))) 0 1 0 3 4 1 5 5
Like Series.map, NA values can be ignored:
>>> df_copy = df.copy() >>> df_copy.iloc[0, 0] = pd.NA >>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore') 0 1 0 NaN 4 1 5.0 5
Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.
>>> df.applymap(lambda x: x**2) 0 1 0 1.000000 4.494400 1 11.262736 20.857489
But it’s better to avoid applymap in that case.
>>> df ** 2 0 1 0 1.000000 4.494400 1 11.262736 20.857489
- add(other, axis='columns', level=None, fill_value=None)
Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- all(axis=0, bool_only=None, skipna=True, **kwargs)
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type:
See also
Series.allReturn True if all elements are True.
DataFrame.anyReturn True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if values in each column all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'to check if values in each row all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=Nonefor whether every value is True.>>> df.all(axis=None) False
- cummax(axis=None, skipna=True, *args, **kwargs)
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative maximum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.maxSimilar functionality but ignores
NaNvalues.DataFrame.maxReturn the maximum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
- cummin(axis=None, skipna=True, *args, **kwargs)
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative minimum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.minSimilar functionality but ignores
NaNvalues.DataFrame.minReturn the minimum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
- cumprod(axis=None, skipna=True, *args, **kwargs)
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative product of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.prodSimilar functionality but ignores
NaNvalues.DataFrame.prodReturn the product over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
- cumsum(axis=None, skipna=True, *args, **kwargs)
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative sum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.sumSimilar functionality but ignores
NaNvalues.DataFrame.sumReturn the sum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
- div(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- divide(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- eq(other, axis='columns', level=None)
Get Equal to of dataframe and other, element-wise (binary operator eq).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- floordiv(other, axis='columns', level=None, fill_value=None)
Get Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- ge(other, axis='columns', level=None)
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- gt(other, axis='columns', level=None)
Get Greater than of dataframe and other, element-wise (binary operator gt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)[source]
Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
- Parameters:
other (DataFrame, Series, or a list containing any combination of them) – Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'left') –
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other’s index.
outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
New in version 1.2.0.
lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.
rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.
sort (bool, default False) – Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
validate (str, optional) – If specified, checks if join is of specified type. * “one_to_one” or “1:1”: check if join keys are unique in both left and right datasets. * “one_to_many” or “1:m”: check if join keys are unique in left dataset. * “many_to_one” or “m:1”: check if join keys are unique in right dataset. * “many_to_many” or “m:m”: allowed, but does not result in checks. .. versionadded:: 1.5.0
- Returns:
A dataframe containing columns from both the caller and other.
- Return type:
See also
DataFrame.mergeFor column(s)-on-column(s) operations.
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.
Support for specifying index levels as the on parameter was added in version 0.23.0.
Examples
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other key B 0 K0 B0 1 K1 B1 2 K2 B2
Join DataFrames using their indexes.
>>> df.join(other, lsuffix='_caller', rsuffix='_other') key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.
>>> df.set_index('key').join(other.set_index('key')) A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.
>>> df.join(other.set_index('key'), on='key') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 NaN 4 K4 A4 NaN 5 K5 A5 NaN
Using non-unique key values shows how they are matched.
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df key A 0 K0 A0 1 K1 A1 2 K1 A2 3 K3 A3 4 K0 A4 5 K1 A5
>>> df.join(other.set_index('key'), on='key', validate='m:1') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K1 A2 B1 3 K3 A3 NaN 4 K0 A4 B0 5 K1 A5 B1
- kurt(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
- kurtosis(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
- le(other, axis='columns', level=None)
Get Less than or equal to of dataframe and other, element-wise (binary operator le).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- lt(other, axis='columns', level=None)
Get Less than of dataframe and other, element-wise (binary operator lt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- max(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax. This is the equivalent of thenumpy.ndarraymethodargmax.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() 8
- mean(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the mean of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
- median(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the median of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
- min(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin. This is the equivalent of thenumpy.ndarraymethodargmin.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() 0
- mod(other, axis='columns', level=None, fill_value=None)
Get Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- mul(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- multiply(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- ne(other, axis='columns', level=None)
Get Not equal to of dataframe and other, element-wise (binary operator ne).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- pow(other, axis='columns', level=None, fill_value=None)
Get Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- prod(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- product(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- radd(other, axis='columns', level=None, fill_value=None)
Get Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rdiv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rfloordiv(other, axis='columns', level=None, fill_value=None)
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmod(other, axis='columns', level=None, fill_value=None)
Get Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmul(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rpow(other, axis='columns', level=None, fill_value=None)
Get Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rsub(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rtruediv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- sem(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters:
axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
- skew(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased skew over requested axis.
Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
- std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
The standard deviation of the columns can be found as follows:
>>> df.std() age 18.786076 height 0.237417 dtype: float64
Alternatively, ddof=0 can be set to normalize by N instead of N-1:
>>> df.std(ddof=0) age 16.269219 height 0.205609 dtype: float64
- sub(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- subtract(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.sum() 14
By default, the sum of an empty or all-NA Series is
0.>>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0
This can be controlled with the
min_countparameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1.>>> pd.Series([], dtype="float64").sum(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() 0.0
>>> pd.Series([np.nan]).sum(min_count=1) nan
- truediv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- var(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
>>> df.var() age 352.916667 height 0.056367 dtype: float64
Alternatively,
ddof=0can be set to normalize by N instead of N-1:>>> df.var(ddof=0) age 264.687500 height 0.042275 dtype: float64
- merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)[source]
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
- Parameters:
right (DataFrame or named Series) – Object to merge with.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
New in version 1.2.0.
on (label or list) – Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index.
sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
copy (bool, default True) – If False, avoid copy if possible.
indicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.
validate (str, optional) –
If specified, checks if merge is of specified type.
”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
”one_to_many” or “1:m”: check if merge keys are unique in left dataset.
”many_to_one” or “m:1”: check if merge keys are unique in right dataset.
”many_to_many” or “m:m”: allowed, but does not result in checks.
- Returns:
A DataFrame of the two merged objects.
- Return type:
See also
merge_orderedMerge with optional filling/interpolation.
merge_asofMerge on nearest keys.
DataFrame.joinSimilar method using indices.
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0 Support for merging named Series objects was added in version 0.24.0
Examples
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]}) >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.
>>> df1.merge(df2, left_on='lkey', right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', ... suffixes=('_left', '_right')) lkey value_left rkey value_right 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False)) Traceback (most recent call last): ... ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]}) >>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]}) >>> df1 a b 0 foo 1 1 bar 2 >>> df2 a c 0 foo 3 1 baz 4
>>> df1.merge(df2, how='inner', on='a') a b c 0 foo 1 3
>>> df1.merge(df2, how='left', on='a') a b c 0 foo 1 3.0 1 bar 2 NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']}) >>> df2 = pd.DataFrame({'right': [7, 8]}) >>> df1 left 0 foo 1 bar >>> df2 right 0 7 1 8
>>> df1.merge(df2, how='cross') left right 0 foo 7 1 foo 8 2 bar 7 3 bar 8
- round(decimals=0, *args, **kwargs)[source]
Round a DataFrame to a variable number of decimal places.
- Parameters:
decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
*args – Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
- Returns:
A DataFrame with the affected columns rounded to the specified number of decimal places.
- Return type:
See also
numpy.aroundRound a numpy array to the given number of decimals.
Series.roundRound a Series to the given number of decimals.
Examples
>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)], ... columns=['dogs', 'cats']) >>> df dogs cats 0 0.21 0.32 1 0.01 0.67 2 0.66 0.03 3 0.21 0.18
By providing an integer each column is rounded to the same number of decimal places
>>> df.round(1) dogs cats 0 0.2 0.3 1 0.0 0.7 2 0.7 0.0 3 0.2 0.2
With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value
>>> df.round({'dogs': 1, 'cats': 0}) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value
>>> decimals = pd.Series([0, 1], index=['cats', 'dogs']) >>> df.round(decimals) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
- corr(method='pearson', min_periods=1, numeric_only=False)[source]
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters:
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
numeric_only (bool, default False) –
Include only float, int or boolean data.
New in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
Correlation matrix.
- Return type:
See also
DataFrame.corrwithCompute pairwise correlation with another DataFrame or Series.
Series.corrCompute the correlation between two Series.
Notes
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
Examples
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) dogs cats dogs 1.0 0.3 cats 0.3 1.0
>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)], ... columns=['dogs', 'cats']) >>> df.corr(min_periods=3) dogs cats dogs 1.0 NaN cats NaN 1.0
- cov(min_periods=None, ddof=1, numeric_only=False)[source]
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
- Parameters:
min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result.
ddof (int, default 1) –
Delta degrees of freedom. The divisor used in calculations is
N - ddof, whereNrepresents the number of elements.New in version 1.1.0.
numeric_only (bool, default False) –
Include only float, int or boolean data.
New in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
The covariance matrix of the series of the DataFrame.
- Return type:
See also
Series.covCompute covariance with another Series.
core.window.ewm.ExponentialMovingWindow.covExponential weighted sample covariance.
core.window.expanding.Expanding.covExpanding sample covariance.
core.window.rolling.Rolling.covRolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], ... columns=['dogs', 'cats']) >>> df.cov() dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(1000, 5), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periodskeyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(20, 3), ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan >>> df.loc[df.index[5:10], 'b'] = np.nan >>> df.cov(min_periods=12) a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
- corrwith(other, axis=0, drop=False, method='pearson', numeric_only=False)[source]
Compute pairwise correlation.
Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
- Parameters:
other (DataFrame, Series) – Object with which to compute correlations.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute row-wise, 1 or ‘columns’ for column-wise.
drop (bool, default False) – Drop missing indices from result.
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float.
numeric_only (bool, default False) –
Include only float, int or boolean data.
New in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
Pairwise correlations.
- Return type:
See also
DataFrame.corrCompute pairwise correlation of columns.
Examples
>>> index = ["a", "b", "c", "d", "e"] >>> columns = ["one", "two", "three", "four"] >>> df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=index, columns=columns) >>> df2 = pd.DataFrame(np.arange(16).reshape(4, 4), index=index[:4], columns=columns) >>> df1.corrwith(df2) one 1.0 two 1.0 three 1.0 four 1.0 dtype: float64
>>> df2.corrwith(df1, axis=1) a 1.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- count(axis=0, numeric_only=False)[source]
Count non-NA cells for each column or row.
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool, default False) – Include only float, int or boolean data.
- Returns:
For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
- Return type:
See also
Series.countNumber of non-NA elements in a Series.
DataFrame.value_countsCount unique combinations of columns.
DataFrame.shapeNumber of DataFrame rows and columns (including NA elements).
DataFrame.isnaBoolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() Person 5 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') 0 3 1 2 2 3 3 3 4 3 dtype: int64
- nunique(axis=0, dropna=True)[source]
Count number of distinct elements in specified axis.
Return Series with number of distinct elements. Can ignore NaN values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
dropna (bool, default True) – Don’t include NaN in the counts.
- Return type:
See also
Series.nuniqueMethod nunique for Series.
DataFrame.countCount non-NA cells for each column or row.
Examples
>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]}) >>> df.nunique() A 3 B 2 dtype: int64
>>> df.nunique(axis=1) 0 1 1 2 2 2 dtype: int64
- idxmin(axis=0, skipna=True, numeric_only=False)[source]
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
numeric_only (bool, default False) –
Include only float, int or boolean data.
New in version 1.5.0.
- Returns:
Indexes of minima along the specified axis.
- Return type:
- Raises:
If the row/column is empty
See also
Series.idxminReturn index of the minimum element.
Notes
This method is the DataFrame version of
ndarray.argmin.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin() consumption Pork co2_emissions Wheat Products dtype: object
To return the index for the minimum value in each row, use
axis="columns".>>> df.idxmin(axis="columns") Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
- idxmax(axis=0, skipna=True, numeric_only=False)[source]
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
numeric_only (bool, default False) –
Include only float, int or boolean data.
New in version 1.5.0.
- Returns:
Indexes of maxima along the specified axis.
- Return type:
- Raises:
If the row/column is empty
See also
Series.idxmaxReturn index of the maximum element.
Notes
This method is the DataFrame version of
ndarray.argmax.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax() consumption Wheat Products co2_emissions Beef dtype: object
To return the index for the maximum value in each row, use
axis="columns".>>> df.idxmax(axis="columns") Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
- mode(axis=0, numeric_only=False, dropna=True)[source]
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
The axis to iterate over while searching for the mode:
0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.
numeric_only (bool, default False) – If True, only apply to numeric columns.
dropna (bool, default True) – Don’t consider counts of NaN/NaT.
- Returns:
The modes of each column or row.
- Return type:
See also
Series.modeReturn the highest frequency value in a Series.
Series.value_countsReturn the counts of values in a Series.
Examples
>>> df = pd.DataFrame([('bird', 2, 2), ... ('mammal', 4, np.nan), ... ('arthropod', 8, 0), ... ('bird', 2, np.nan)], ... index=('falcon', 'horse', 'spider', 'ostrich'), ... columns=('species', 'legs', 'wings')) >>> df species legs wings falcon bird 2 2.0 horse mammal 4 NaN spider arthropod 8 0.0 ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of
speciesandlegscontainsNaN.>>> df.mode() species legs wings 0 bird 2.0 0.0 1 NaN NaN 2.0
Setting
dropna=FalseNaNvalues are considered and they can be the mode (like for wings).>>> df.mode(dropna=False) species legs wings 0 bird 2 NaN
Setting
numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.>>> df.mode(numeric_only=True) legs wings 0 2.0 0.0 1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
>>> df.mode(axis='columns', numeric_only=True) 0 1 falcon 2.0 NaN horse 4.0 NaN spider 0.0 8.0 ostrich 2.0 NaN
- quantile(q: float = 0.5, axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series[source]
- quantile(q: ExtensionArray | ndarray | Index | Series | Sequence[float], axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series | DataFrame
- quantile(q: float | ExtensionArray | ndarray | Index | Series | Sequence[float] = 0.5, axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series | DataFrame
Return values at the given quantile over requested axis.
- Parameters:
q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
method ({'single', 'table'}, default 'single') – Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.
- Returns:
- If
qis an array, a DataFrame will be returned where the index is
q, the columns are the columns of self, and the values are the quantiles.- If
qis a float, a Series will be returned where the index is the columns of self and the values are the quantiles.
- If
- Return type:
See also
core.window.rolling.Rolling.quantileRolling quantile.
numpy.percentileNumpy function to compute the percentile.
Examples
>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), ... columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 Name: 0.1, dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0
Specifying method=’table’ will compute the quantile over all columns.
>>> df.quantile(.1, method="table", interpolation="nearest") a 1 b 1 Name: 0.1, dtype: int64 >>> df.quantile([.1, .5], method="table", interpolation="nearest") a b 0.1 1 1 0.5 3 100
Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.
>>> df = pd.DataFrame({'A': [1, 2], ... 'B': [pd.Timestamp('2010'), ... pd.Timestamp('2011')], ... 'C': [pd.Timedelta('1 days'), ... pd.Timedelta('2 days')]}) >>> df.quantile(0.5, numeric_only=False) A 1.5 B 2010-07-02 12:00:00 C 1 days 12:00:00 Name: 0.5, dtype: object
- asfreq(freq, method=None, how=None, normalize=False, fill_value=None)[source]
Convert time series to specified frequency.
Returns the original data conformed to a new index with the specified frequency.
If the index of this DataFrame is a
PeriodIndex, the new index is the result of transforming the original index withPeriodIndex.asfreq(so the original index will map one-to-one to the new index).Otherwise, the new index will be equivalent to
pd.date_range(start, end, freq=freq)wherestartandendare, respectively, the first and last entries in the original index (seepandas.date_range()). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see themethodparameter below).The
resample()method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.- Parameters:
freq (DateOffset or str) – Frequency DateOffset or string.
method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) –
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
’pad’ / ‘ffill’: propagate last valid observation forward to next valid
’backfill’ / ‘bfill’: use NEXT valid observation to fill.
how ({'start', 'end'}, default end) – For PeriodIndex only (see PeriodIndex.asfreq).
normalize (bool, default False) – Whether to reset output index to midnight.
fill_value (scalar, optional) – Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
- Returns:
DataFrame object reindexed to the specified frequency.
- Return type:
See also
reindexConform DataFrame to new index with optional filling logic.
Notes
To learn more about the frequency strings, please see this link.
Examples
Start by creating a series with 4 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=4, freq='T') >>> series = pd.Series([0.0, None, 2.0, 3.0], index=index) >>> df = pd.DataFrame({'s': series}) >>> df s 2000-01-01 00:00:00 0.0 2000-01-01 00:01:00 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:03:00 3.0
Upsample the series into 30 second bins.
>>> df.asfreq(freq='30S') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 NaN 2000-01-01 00:03:00 3.0
Upsample again, providing a
fill value.>>> df.asfreq(freq='30S', fill_value=9.0) s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 9.0 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 9.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 9.0 2000-01-01 00:03:00 3.0
Upsample again, providing a
method.>>> df.asfreq(freq='30S', method='bfill') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 2.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 3.0 2000-01-01 00:03:00 3.0
- resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None, group_keys=False)[source]
Resample time-series data.
Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the
on/levelkeyword parameter.- Parameters:
rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or end of rule.
kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
on (str, optional) – For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
level (str or int, optional) – For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.
origin (Timestamp or str, default 'start_day') –
The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:
’epoch’: origin is 1970-01-01
’start’: origin is the first value of the timeseries
’start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
’end’: origin is the last value of the timeseries
’end_day’: origin is the ceiling midnight of the last day
New in version 1.3.0.
offset (Timedelta or str, default is None) –
An offset timedelta added to the origin.
New in version 1.1.0.
group_keys (bool, default False) –
Whether to include the group keys in the result index when using
.apply()on the resampled object.New in version 1.5.0: Not specifying
group_keyswill retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).Changed in version 2.0.0:
group_keysnow defaults toFalse.
- Returns:
Resamplerobject.- Return type:
pandas.core.Resampler
See also
Series.resampleResample a Series.
DataFrame.resampleResample a DataFrame.
groupbyGroup DataFrame by mapping, function, label, or list of labels.
asfreqReindex a DataFrame with the given frequency without grouping.
Notes
See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] # Select first 5 rows 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaNvalues using theffillmethod.>>> series.resample('30S').ffill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaNvalues using thebfillmethod.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply>>> def custom_resampler(arraylike): ... return np.sum(arraylike) + 5 ... >>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', ... freq='A', ... periods=2)) >>> s 2012 1 2013 2 Freq: A-DEC, dtype: int64 >>> s.resample('Q', convention='start').asfreq() 2012Q1 1.0 2012Q2 NaN 2012Q3 NaN 2012Q4 NaN 2013Q1 2.0 2013Q2 NaN 2013Q3 NaN 2013Q4 NaN Freq: Q-DEC, dtype: float64
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01', ... freq='Q', ... periods=4)) >>> q 2018Q1 1 2018Q2 2 2018Q3 3 2018Q4 4 Freq: Q-DEC, dtype: int64 >>> q.resample('M', convention='end').asfreq() 2018-03 1.0 2018-04 NaN 2018-05 NaN 2018-06 2.0 2018-07 NaN 2018-08 NaN 2018-09 3.0 2018-10 NaN 2018-11 NaN 2018-12 4.0 Freq: M, dtype: float64
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.
>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df = pd.DataFrame(d) >>> df['week_starting'] = pd.date_range('01/01/2018', ... periods=8, ... freq='W') >>> df price volume week_starting 0 10 50 2018-01-07 1 11 60 2018-01-14 2 9 40 2018-01-21 3 13 100 2018-01-28 4 14 50 2018-02-04 5 18 100 2018-02-11 6 17 40 2018-02-18 7 19 50 2018-02-25 >>> df.resample('M', on='week_starting').mean() price volume week_starting 2018-01-31 10.75 62.5 2018-02-28 17.00 60.0
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') >>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df2 = pd.DataFrame( ... d2, ... index=pd.MultiIndex.from_product( ... [days, ['morning', 'afternoon']] ... ) ... ) >>> df2 price volume 2000-01-01 morning 10 50 afternoon 11 60 2000-01-02 morning 9 40 afternoon 13 100 2000-01-03 morning 14 50 afternoon 18 100 2000-01-04 morning 17 40 afternoon 19 50 >>> df2.resample('D', level=0).sum() price volume 2000-01-01 21 110 2000-01-02 22 140 2000-01-03 32 150 2000-01-04 36 90
If you want to adjust the start of the bins based on a fixed timestamp:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00' >>> rng = pd.date_range(start, end, freq='7min') >>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng) >>> ts 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7T, dtype: int64
>>> ts.resample('17min').sum() 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum() 2000-10-01 23:18:00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum() 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.resample('17min', origin='start').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
If you want to take the largest Timestamp as the end of the bins:
>>> ts.resample('17min', origin='end').sum() 2000-10-01 23:35:00 0 2000-10-01 23:52:00 18 2000-10-02 00:09:00 27 2000-10-02 00:26:00 63 Freq: 17T, dtype: int64
In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:
>>> ts.resample('17min', origin='end_day').sum() 2000-10-01 23:38:00 3 2000-10-01 23:55:00 15 2000-10-02 00:12:00 45 2000-10-02 00:29:00 45 Freq: 17T, dtype: int64
- to_timestamp(freq=None, how='start', axis=0, copy=None)[source]
Cast to DatetimeIndex of timestamps, at beginning of period.
- Parameters:
freq (str, default frequency of PeriodIndex) – Desired frequency.
how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).
copy (bool, default True) – If False then underlying input data is not copied.
- Returns:
The DataFrame has a DatetimeIndex.
- Return type:
Examples
>>> idx = pd.PeriodIndex(['2023', '2024'], freq='Y') >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df1 = pd.DataFrame(data=d, index=idx) >>> df1 col1 col2 2023 1 3 2024 2 4
The resulting timestamps will be at the beginning of the year in this case
>>> df1 = df1.to_timestamp() >>> df1 col1 col2 2023-01-01 1 3 2024-01-01 2 4 >>> df1.index DatetimeIndex(['2023-01-01', '2024-01-01'], dtype='datetime64[ns]', freq=None)
Using freq which is the offset that the Timestamps will have
>>> df2 = pd.DataFrame(data=d, index=idx) >>> df2 = df2.to_timestamp(freq='M') >>> df2 col1 col2 2023-01-31 1 3 2024-01-31 2 4 >>> df2.index DatetimeIndex(['2023-01-31', '2024-01-31'], dtype='datetime64[ns]', freq=None)
- to_period(freq=None, axis=0, copy=None)[source]
Convert DataFrame from DatetimeIndex to PeriodIndex.
Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed).
- Parameters:
- Returns:
The DataFrame has a PeriodIndex.
- Return type:
Examples
>>> idx = pd.to_datetime( ... [ ... "2001-03-31 00:00:00", ... "2002-05-31 00:00:00", ... "2003-08-31 00:00:00", ... ] ... )
>>> idx DatetimeIndex(['2001-03-31', '2002-05-31', '2003-08-31'], dtype='datetime64[ns]', freq=None)
>>> idx.to_period("M") PeriodIndex(['2001-03', '2002-05', '2003-08'], dtype='period[M]')
For the yearly frequency
>>> idx.to_period("Y") PeriodIndex(['2001', '2002', '2003'], dtype='period[A-DEC]')
- isin(values)[source]
Whether each element in the DataFrame is contained in values.
- Parameters:
values (iterable, Series, DataFrame or dict) – The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
- Returns:
DataFrame of booleans showing whether each element in the DataFrame is contained in values.
- Return type:
See also
DataFrame.eqEquality test for DataFrame.
Series.isinEquivalent method on Series.
Series.str.containsTest if pattern or regex is contained within a string of a Series or Index.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0
When
valuesis a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)>>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True
To check if
valuesis not in the DataFrame, use the~operator:>>> ~df.isin([0, 2]) num_legs num_wings falcon False False dog True False
When
valuesis a dict, we can pass values to check for each column separately:>>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True
When
valuesis a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in other.>>> other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon False True dog False False
- index
The index (row labels) of the DataFrame.
- columns
The column labels of the DataFrame.
- plot
alias of
PlotAccessor
- hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)
Make a histogram of the DataFrame’s columns.
A histogram is a representation of the distribution of data. This function calls
matplotlib.pyplot.hist(), on each series in the DataFrame, resulting in one histogram per column.- Parameters:
data (DataFrame) – The pandas object holding the data.
column (str or sequence, optional) – If passed, will be used to limit data to a subset of columns.
by (object, optional) – If passed, then used to form histograms for separate groups.
grid (bool, default True) – Whether to show axis grid lines.
xlabelsize (int, default None) – If specified changes the x-axis label size.
xrot (float, default None) – Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.
ylabelsize (int, default None) – If specified changes the y-axis label size.
yrot (float, default None) – Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.
ax (Matplotlib axes object, default None) – The axes to plot the histogram on.
sharex (bool, default True if ax is None else False) – In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.
sharey (bool, default False) – In case subplots=True, share y axis and set some y axis labels to invisible.
figsize (tuple, optional) – The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.
layout (tuple, optional) – Tuple of (rows, columns) for the layout of the histograms.
bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.legend (bool, default False) –
Whether to show the legend.
New in version 1.1.0.
**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.hist().
- Return type:
matplotlib.AxesSubplot or numpy.ndarray of them
See also
matplotlib.pyplot.histPlot a histogram using matplotlib.
Examples
This example draws a histogram based on the length and width of some animals, displayed in three bins
- boxplot(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, backend=None, **kwargs)
Make a box plot from DataFrame columns.
Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.
For further details see Wikipedia’s entry for boxplot.
- Parameters:
column (str or list of str, optional) – Column name or list of names, or vector. Can be any valid input to
pandas.DataFrame.groupby().by (str or array-like, optional) – Column in the DataFrame to
pandas.DataFrame.groupby(). One box-plot will be done per value of columns in by.ax (object of class matplotlib.axes.Axes, optional) – The matplotlib axes to be used by boxplot.
fontsize (float or str) – Tick label font size in points or as a string (e.g., large).
rot (float, default 0) – The rotation angle of labels (in degrees) with respect to the screen coordinate system.
grid (bool, default True) – Setting this to True will show the grid.
figsize (A tuple (width, height) in inches) – The size of the figure to create in matplotlib.
layout (tuple (rows, columns), optional) – For example, (3, 5) will display the subplots using 3 rows and 5 columns, starting from the top-left.
return_type ({'axes', 'dict', 'both'} or None, default 'axes') –
The kind of object to return. The default is
axes.’axes’ returns the matplotlib axes the boxplot is drawn on.
’dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
’both’ returns a namedtuple with the axes and dict.
when grouping with
by, a Series mapping columns toreturn_typeis returned.If
return_typeis None, a NumPy array of axes with the same shape aslayoutis returned.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.boxplot().
- Returns:
See Notes.
- Return type:
result
See also
pandas.Series.plot.histMake a histogram.
matplotlib.pyplot.boxplotMatplotlib equivalent plot.
Notes
The return type depends on the return_type parameter:
‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with
by, return a Series of the above or a numpy array:array(forreturn_type = None)
Use
return_type='dict'when you want to tweak the appearance of the lines after plotting. In this case a dict containing the Lines making up the boxes, caps, fliers, medians, and whiskers is returned.Examples
Boxplots can be created for every column in the dataframe by
df.boxplot()or indicating the columns to be used:Boxplots of variables distributions grouped by the values of a third variable can be created using the option
by. For instance:A list of strings (i.e.
['X', 'Y']) can be passed to boxplot in order to group the data by combination of the variables in the x-axis:The layout of boxplot can be adjusted giving a tuple to
layout:Additional formatting can be done to the boxplot, like suppressing the grid (
grid=False), rotating the labels in the x-axis (i.e.rot=45) or changing the fontsize (i.e.fontsize=15):The parameter
return_typecan be used to select the type of element returned by boxplot. Whenreturn_type='axes'is selected, the matplotlib axes on which the boxplot is drawn are returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], return_type='axes') >>> type(boxplot) <class 'matplotlib.axes._subplots.AxesSubplot'>
When grouping with
by, a Series mapping columns toreturn_typeis returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type='axes') >>> type(boxplot) <class 'pandas.core.series.Series'>
If
return_typeis None, a NumPy array of axes with the same shape aslayoutis returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type=None) >>> type(boxplot) <class 'numpy.ndarray'>
- sparse
alias of
SparseFrameAccessor
- property values: ndarray
Return a Numpy representation of the DataFrame.
Warning
We recommend using
DataFrame.to_numpy()instead.Only the values in the DataFrame will be returned, the axes labels will be removed.
- Returns:
The values of the DataFrame.
- Return type:
numpy.ndarray
See also
DataFrame.to_numpyRecommended alternative to this method.
DataFrame.indexRetrieve the index labels.
DataFrame.columnsRetrieving the column names.
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By
numpy.find_common_type()convention, mixing int64 and uint64 will result in a float64 dtype.Examples
A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
>>> df = pd.DataFrame({'age': [ 3, 29], ... 'height': [94, 170], ... 'weight': [31, 115]}) >>> df age height weight 0 3 94 31 1 29 170 115 >>> df.dtypes age int64 height int64 weight int64 dtype: object >>> df.values array([[ 3, 94, 31], [ 29, 170, 115]])
A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types (e.g., object).
>>> df2 = pd.DataFrame([('parrot', 24.0, 'second'), ... ('lion', 80.5, 1), ... ('monkey', np.nan, None)], ... columns=('name', 'max_speed', 'rank')) >>> df2.dtypes name object max_speed float64 rank object dtype: object >>> df2.values array([['parrot', 24.0, 'second'], ['lion', 80.5, 1], ['monkey', nan, None]], dtype=object)
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) DataFrame[source]
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) DataFrame | None
Synonym for
DataFrame.fillna()withmethod='ffill'.- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
Series/DataFrame or None
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast=None) DataFrame[source]
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast=None) None
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast=None) DataFrame | None
Synonym for
DataFrame.fillna()withmethod='bfill'.- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
Series/DataFrame or None
- clip(lower=None, upper=None, *, axis=None, inplace=False, **kwargs)[source]
Trim values at input threshold(s).
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
- Parameters:
lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.
inplace (bool, default False) – Whether to perform the operation in place on the data.
*args – Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
self (DataFrame) –
- Returns:
Same type as calling object with the values outside the clip boundaries replaced or None if
inplace=True.- Return type:
See also
Series.clipTrim values at input threshold in series.
DataFrame.clipTrim values at input threshold in dataframe.
numpy.clipClip (limit) the values in an array.
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} >>> df = pd.DataFrame(data) >>> df col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) >>> t 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
Clips using specific lower threshold per column element, with missing values:
>>> t = pd.Series([2, -4, np.NaN, 6, 3]) >>> t 0 2.0 1 -4.0 2 NaN 3 6.0 4 3.0 dtype: float64
>>> df.clip(t, axis=0) col_0 col_1 0 9 2 1 -3 -4 2 0 6 3 6 8 4 5 3
- interpolate(method='linear', *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)[source]
Fill NaN values using an interpolation method.
Please note that only
method='linear'is supported for DataFrame/Series with a MultiIndex.- Parameters:
method (str, default 'linear') –
Interpolation technique to use. One of:
’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
’time’: Works on daily and higher resolution data to interpolate given length of interval.
’index’, ‘values’: use the actual numerical values of the index.
’pad’: Fill in NaNs using existing values.
’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d, whereas ‘spline’ is passed to scipy.interpolate.UnivariateSpline. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g.
df.interpolate(method='polynomial', order=5). Note that, slinear method in Pandas refers to the Scipy first order spline instead of Pandas first order spline.’krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
’from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Axis to interpolate along. For Series this parameter is unused and defaults to 0.
limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.
inplace (bool, default False) – Update the data in place if possible.
limit_direction ({{'forward', 'backward', 'both'}}, Optional) –
Consecutive NaNs will be filled in this direction.
- If limit is specified:
If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.
- If ‘limit’ is not specified:
If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
else the default is ‘forward’
Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.
limit_area ({{None, ‘inside’, ‘outside’}}, default None) –
If limit is specified, consecutive NaNs will be filled with this restriction.
None: No fill restriction.’inside’: Only fill NaNs surrounded by valid values (interpolate).
’outside’: Only fill NaNs outside valid values (extrapolate).
downcast (optional, 'infer' or None, defaults to None) – Downcast dtypes if possible.
**kwargs (optional) – Keyword arguments to pass on to the interpolating function.
self (DataFrame) –
- Returns:
Returns the same object type as the caller, interpolated at some or all
NaNvalues or None ifinplace=True.- Return type:
See also
fillnaFill missing values using different methods.
scipy.interpolate.Akima1DInterpolatorPiecewise cubic polynomials (Akima interpolator).
scipy.interpolate.BPoly.from_derivativesPiecewise polynomial in the Bernstein basis.
scipy.interpolate.interp1dInterpolate a 1-D function.
scipy.interpolate.KroghInterpolatorInterpolate polynomial (Krogh interpolator).
scipy.interpolate.PchipInterpolatorPCHIP 1-d monotonic cubic interpolation.
scipy.interpolate.CubicSplineCubic spline data interpolator.
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation.
Examples
Filling in
NaNin aSeriesvia linear interpolation.>>> s = pd.Series([0, 1, np.nan, 3]) >>> s 0 0.0 1 1.0 2 NaN 3 3.0 dtype: float64 >>> s.interpolate() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Filling in
NaNin a Series by padding, but filling at most two consecutiveNaNat a time.>>> s = pd.Series([np.nan, "single_one", np.nan, ... "fill_two_more", np.nan, np.nan, np.nan, ... 4.71, np.nan]) >>> s 0 NaN 1 single_one 2 NaN 3 fill_two_more 4 NaN 5 NaN 6 NaN 7 4.71 8 NaN dtype: object >>> s.interpolate(method='pad', limit=2) 0 NaN 1 single_one 2 single_one 3 fill_two_more 4 fill_two_more 5 fill_two_more 6 NaN 7 4.71 8 4.71 dtype: object
Filling in
NaNin a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods require that you also specify anorder(int).>>> s = pd.Series([0, 2, np.nan, 8]) >>> s.interpolate(method='polynomial', order=2) 0 0.000000 1 2.000000 2 4.666667 3 8.000000 dtype: float64
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains
NaN, because there is no entry before it to use for interpolation.>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0), ... (np.nan, 2.0, np.nan, np.nan), ... (2.0, 3.0, np.nan, 9.0), ... (np.nan, 4.0, -4.0, 16.0)], ... columns=list('abcd')) >>> df a b c d 0 0.0 NaN -1.0 1.0 1 NaN 2.0 NaN NaN 2 2.0 3.0 NaN 9.0 3 NaN 4.0 -4.0 16.0 >>> df.interpolate(method='linear', limit_direction='forward', axis=0) a b c d 0 0.0 NaN -1.0 1.0 1 1.0 2.0 -2.0 5.0 2 2.0 3.0 -3.0 9.0 3 2.0 4.0 -4.0 16.0
Using polynomial interpolation.
>>> df['d'].interpolate(method='polynomial', order=2) 0 1.0 1 4.0 2 9.0 3 16.0 Name: d, dtype: float64
- where(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame[source]
- where(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
- where(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame | None
Replace values where the condition is False.
- Parameters:
cond (bool Series/DataFrame, array-like, or callable) – Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
other (scalar, Series/DataFrame, or callable) – Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (
np.nanfor numpy dtypes,pd.NAfor extension dtypes).inplace (bool, default False) – Whether to perform the operation in place on the data.
axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.
level (int, default None) – Alignment level if needed.
- Return type:
Same type as caller or None if
inplace=True.
See also
DataFrame.mask()Return an object of same shape as self.
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
condisTruethe element is used; otherwise the corresponding element from the DataFrameotheris used. If the axis ofotherdoes not align with axis ofcondSeries/DataFrame, the misaligned index positions will be filled with False.The signature for
DataFrame.where()differs fromnumpy.where(). Roughlydf1.where(m, df2)is equivalent tonp.where(m, df1, df2).For further details and examples see the
wheredocumentation in indexing.The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
>>> s = pd.Series(range(5)) >>> t = pd.Series([True, False]) >>> s.where(t, 99) 0 0 1 99 2 99 3 99 4 99 dtype: int64 >>> s.mask(t, 99) 0 99 1 1 2 99 3 99 4 99 dtype: int64
>>> s.where(s > 1, 10) 0 10 1 10 2 2 3 3 4 4 dtype: int64 >>> s.mask(s > 1, 10) 0 0 1 1 2 10 3 10 4 10 dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> df A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
- mask(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame[source]
- mask(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
- mask(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame | None
Replace values where the condition is True.
- Parameters:
cond (bool Series/DataFrame, array-like, or callable) – Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
other (scalar, Series/DataFrame, or callable) – Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (
np.nanfor numpy dtypes,pd.NAfor extension dtypes).inplace (bool, default False) – Whether to perform the operation in place on the data.
axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.
level (int, default None) – Alignment level if needed.
- Return type:
Same type as caller or None if
inplace=True.
See also
DataFrame.where()Return an object of same shape as self.
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
condisFalsethe element is used; otherwise the corresponding element from the DataFrameotheris used. If the axis ofotherdoes not align with axis ofcondSeries/DataFrame, the misaligned index positions will be filled with True.The signature for
DataFrame.where()differs fromnumpy.where(). Roughlydf1.where(m, df2)is equivalent tonp.where(m, df1, df2).For further details and examples see the
maskdocumentation in indexing.The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
>>> s = pd.Series(range(5)) >>> t = pd.Series([True, False]) >>> s.where(t, 99) 0 0 1 99 2 99 3 99 4 99 dtype: int64 >>> s.mask(t, 99) 0 99 1 1 2 99 3 99 4 99 dtype: int64
>>> s.where(s > 1, 10) 0 10 1 10 2 2 3 3 4 4 dtype: int64 >>> s.mask(s > 1, 10) 0 0 1 1 2 10 3 10 4 10 dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> df A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
- class pandas.DateOffset
Standard kind of date increment used for a date range.
Works exactly like the keyword argument form of relativedelta. Note that the positional argument form of relativedelata is not supported. Use of the keyword n is discouraged– you would be better off specifying n in the keywords you use, but regardless it is there for you. n is needed for DateOffset subclasses.
DateOffset works as follows. Each offset specify a set of dates that conform to the DateOffset. For example, Bday defines this set to be the set of dates that are weekdays (M-F). To test if a date is in the set of a DateOffset dateOffset we can use the is_on_offset method: dateOffset.is_on_offset(date).
If a date is not on a valid date, the rollback and rollforward methods can be used to roll the date to the nearest valid date before/after the date.
DateOffsets can be created to move dates forward a given number of valid dates. For example, Bday(2) can be added to a date to move it two business days forward. If the date does not start on a valid date, first it is moved to a valid date. Thus pseudo code is:
def __add__(date): date = rollback(date) # does nothing if date is valid return date + <n number of periods>
When a date offset is created for a negative number of periods, the date is first rolled forward. The pseudo code is:
def __add__(date): date = rollforward(date) # does nothing if date is valid return date + <n number of periods>
Zero presents a problem. Should it roll forward or back? We arbitrarily have it rollforward:
date + BDay(0) == BDay.rollforward(date)
Since 0 is a bit weird, we suggest avoiding its use.
Besides, adding a DateOffsets specified by the singular form of the date component can be used to replace certain component of the timestamp.
- Parameters:
n (int, default 1) – The number of time periods the offset represents. If specified without a temporal pattern, defaults to n days.
normalize (bool, default False) – Whether to round the result of a DateOffset addition down to the previous midnight.
**kwds –
Temporal parameter that add to or replace the offset value.
Parameters that add to the offset (like Timedelta):
years
months
weeks
days
hours
minutes
seconds
milliseconds
microseconds
nanoseconds
Parameters that replace the offset value:
year
month
day
weekday
hour
minute
second
microsecond
nanosecond.
See also
dateutil.relativedelta.relativedeltaThe relativedelta type is designed to be applied to an existing datetime an can replace specific components of that datetime, or represents an interval of time.
Examples
>>> from pandas.tseries.offsets import DateOffset >>> ts = pd.Timestamp('2017-01-01 09:10:11') >>> ts + DateOffset(months=3) Timestamp('2017-04-01 09:10:11')
>>> ts = pd.Timestamp('2017-01-01 09:10:11') >>> ts + DateOffset(months=2) Timestamp('2017-03-01 09:10:11') >>> ts + DateOffset(day=31) Timestamp('2017-01-31 09:10:11')
>>> ts + pd.DateOffset(hour=8) Timestamp('2017-01-01 08:10:11')
- class pandas.DatetimeIndex[source]
Immutable ndarray-like of datetime64 data.
Represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata.
Changed in version 2.0.0: The various numeric date/time attributes (
day,month,yearetc.) now have dtypeint32. Previously they had dtypeint64.- Parameters:
data (array-like (1-dimensional)) – Datetime-like data to construct index with.
freq (str or pandas offset object, optional) – One of pandas date offset strings or corresponding objects. The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation.
tz (pytz.timezone or dateutil.tz.tzfile or datetime.tzinfo or str) – Set the Timezone of the data.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
closed ({'left', 'right'}, optional) – Set whether to include start and end that are on the boundary. The default includes boundary points on either end.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
dayfirst (bool, default False) – If True, parse dates in data with the day first order.
yearfirst (bool, default False) – If True parse dates in data with the year first order.
dtype (numpy.dtype or DatetimeTZDtype or str, default None) – Note that the only NumPy dtype allowed is ‘datetime64[ns]’.
copy (bool, default False) – Make a copy of input ndarray.
name (label, default None) – Name to be stored in the index.
- Return type:
- year
- month
- day
- hour
- minute
- second
- microsecond
- nanosecond
- date
- time
- timetz
- dayofyear
- day_of_year
- weekofyear
- week
- dayofweek
- day_of_week
- weekday
- quarter
- tz
- Type:
dt.tzinfo | None
- freq
- freqstr
- is_month_start
- is_month_end
- is_quarter_start
- is_quarter_end
- is_year_start
- is_year_end
- is_leap_year
- inferred_freq
- normalize()
- tz_localize()[source]
- Parameters:
ambiguous (TimeAmbiguous) –
nonexistent (TimeNonexistent) –
- Return type:
- round()
- floor()
- ceil()
- to_period()
- to_pydatetime()
- to_series()
- to_frame()
- month_name()
- day_name()
- mean()
- std()
See also
IndexThe base pandas Index type.
TimedeltaIndexIndex of timedelta64 data.
PeriodIndexIndex of Period data.
to_datetimeConvert argument to datetime.
date_rangeCreate a fixed-frequency DatetimeIndex.
Notes
To learn more about the frequency strings, please see this link.
- property tz
Return the timezone.
- Returns:
Returns None when the array is tz-naive.
- Return type:
datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None
- strftime(date_format)[source]
Convert to Index using specified date_format.
Return an Index of formatted strings specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc.
Formats supported by the C strftime API but not by the python string format doc (such as “%R”, “%r”) are not officially supported and should be preferably replaced with their supported equivalents (such as “%H:%M”, “%I:%M:%S %p”).
Note that PeriodIndex support additional directives, detailed in Period.strftime.
- Parameters:
date_format (str) – Date format string (e.g. “%Y-%m-%d”).
- Returns:
NumPy ndarray of formatted strings.
- Return type:
ndarray[object]
See also
to_datetimeConvert the given argument to datetime.
DatetimeIndex.normalizeReturn DatetimeIndex with times to midnight.
DatetimeIndex.roundRound the DatetimeIndex to the specified freq.
DatetimeIndex.floorFloor the DatetimeIndex to the specified freq.
Timestamp.strftimeFormat a single Timestamp.
Period.strftimeFormat a single Period.
Examples
>>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"), ... periods=3, freq='s') >>> rng.strftime('%B %d, %Y, %r') Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM', 'March 10, 2018, 09:00:02 AM'], dtype='object')
- tz_convert(tz)[source]
Convert tz-aware Datetime Array/Index from one time zone to another.
- Parameters:
tz (str, pytz.timezone, dateutil.tz.tzfile, datetime.tzinfo or None) – Time zone for time. Corresponding timestamps would be converted to this time zone of the Datetime Array/Index. A tz of None will convert to UTC and remove the timezone information.
- Return type:
Array or Index
- Raises:
TypeError – If Datetime Array/Index is tz-naive.
See also
DatetimeIndex.tzA timezone that has a variable offset from UTC.
DatetimeIndex.tz_localizeLocalize tz-naive DatetimeIndex to a given time zone, or remove timezone from a tz-aware DatetimeIndex.
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:
>>> dti = pd.date_range(start='2014-08-01 09:00', ... freq='H', periods=3, tz='Europe/Berlin')
>>> dti DatetimeIndex(['2014-08-01 09:00:00+02:00', '2014-08-01 10:00:00+02:00', '2014-08-01 11:00:00+02:00'], dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central') DatetimeIndex(['2014-08-01 02:00:00-05:00', '2014-08-01 03:00:00-05:00', '2014-08-01 04:00:00-05:00'], dtype='datetime64[ns, US/Central]', freq='H')
With the
tz=None, we can remove the timezone (after converting to UTC if necessary):>>> dti = pd.date_range(start='2014-08-01 09:00', freq='H', ... periods=3, tz='Europe/Berlin')
>>> dti DatetimeIndex(['2014-08-01 09:00:00+02:00', '2014-08-01 10:00:00+02:00', '2014-08-01 11:00:00+02:00'], dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None) DatetimeIndex(['2014-08-01 07:00:00', '2014-08-01 08:00:00', '2014-08-01 09:00:00'], dtype='datetime64[ns]', freq='H')
- tz_localize(tz, ambiguous='raise', nonexistent='raise')[source]
Localize tz-naive Datetime Array/Index to tz-aware Datetime Array/Index.
This method takes a time zone (tz) naive Datetime Array/Index object and makes this time zone aware. It does not move the time to another time zone.
This method can also be used to do the inverse – to create a time zone unaware object from an aware object. To that end, pass tz=None.
- Parameters:
tz (str, pytz.timezone, dateutil.tz.tzfile, datetime.tzinfo or None) – Time zone to convert timestamps to. Passing
Nonewill remove the time zone information preserving local time.ambiguous ('infer', 'NaT', bool array, default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Array/Index converted to the specified time zone.
- Return type:
Same type as self
- Raises:
TypeError – If the Datetime Array/Index is tz-aware and tz is not None.
See also
DatetimeIndex.tz_convertConvert tz-aware DatetimeIndex from one time zone to another.
Examples
>>> tz_naive = pd.date_range('2018-03-01 09:00', periods=3) >>> tz_naive DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00', '2018-03-03 09:00:00'], dtype='datetime64[ns]', freq='D')
Localize DatetimeIndex in US/Eastern time zone:
>>> tz_aware = tz_naive.tz_localize(tz='US/Eastern') >>> tz_aware DatetimeIndex(['2018-03-01 09:00:00-05:00', '2018-03-02 09:00:00-05:00', '2018-03-03 09:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
With the
tz=None, we can remove the time zone information while keeping the local time (not converted to UTC):>>> tz_aware.tz_localize(None) DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00', '2018-03-03 09:00:00'], dtype='datetime64[ns]', freq=None)
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
>>> s = pd.to_datetime(pd.Series(['2018-10-28 01:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 03:00:00', ... '2018-10-28 03:30:00'])) >>> s.dt.tz_localize('CET', ambiguous='infer') 0 2018-10-28 01:30:00+02:00 1 2018-10-28 02:00:00+02:00 2 2018-10-28 02:30:00+02:00 3 2018-10-28 02:00:00+01:00 4 2018-10-28 02:30:00+01:00 5 2018-10-28 03:00:00+01:00 6 2018-10-28 03:30:00+01:00 dtype: datetime64[ns, CET]
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly
>>> s = pd.to_datetime(pd.Series(['2018-10-28 01:20:00', ... '2018-10-28 02:36:00', ... '2018-10-28 03:46:00'])) >>> s.dt.tz_localize('CET', ambiguous=np.array([True, True, False])) 0 2018-10-28 01:20:00+02:00 1 2018-10-28 02:36:00+02:00 2 2018-10-28 03:46:00+01:00 dtype: datetime64[ns, CET]
If the DST transition causes nonexistent times, you can shift these dates forward or backwards with a timedelta object or ‘shift_forward’ or ‘shift_backwards’.
>>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00', ... '2015-03-29 03:30:00'])) >>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward') 0 2015-03-29 03:00:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, Europe/Warsaw]
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_backward') 0 2015-03-29 01:59:59.999999999+01:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, Europe/Warsaw]
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H')) 0 2015-03-29 03:30:00+02:00 1 2015-03-29 03:30:00+02:00 dtype: datetime64[ns, Europe/Warsaw]
- to_period(*args, **kwargs)
Cast to PeriodArray/Index at a particular frequency.
Converts DatetimeArray/Index to PeriodArray/Index.
- Parameters:
freq (str or Offset, optional) – One of pandas’ offset strings or an Offset object. Will be inferred by default.
- Return type:
PeriodArray/Index
- Raises:
ValueError – When converting a DatetimeArray/Index with non-regular values, so that a frequency cannot be inferred.
See also
PeriodIndexImmutable ndarray holding ordinal values.
DatetimeIndex.to_pydatetimeReturn DatetimeIndex as object.
Examples
>>> df = pd.DataFrame({"y": [1, 2, 3]}, ... index=pd.to_datetime(["2000-03-31 00:00:00", ... "2000-05-31 00:00:00", ... "2000-08-31 00:00:00"])) >>> df.index.to_period("M") PeriodIndex(['2000-03', '2000-05', '2000-08'], dtype='period[M]')
Infer the daily frequency
>>> idx = pd.date_range("2017-01-01", periods=2) >>> idx.to_period() PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]')
- to_julian_date()[source]
Convert Datetime Array to float64 ndarray of Julian Dates. 0 Julian date is noon January 1, 4713 BC. https://en.wikipedia.org/wiki/Julian_day
- Return type:
- isocalendar()[source]
Calculate year, week, and day according to the ISO 8601 standard.
New in version 1.1.0.
- Returns:
With columns year, week and day.
- Return type:
See also
Timestamp.isocalendarFunction return a 3-tuple containing ISO year, week number, and weekday for the given Timestamp object.
datetime.date.isocalendarReturn a named tuple object with three components: year, week and weekday.
Examples
>>> idx = pd.date_range(start='2019-12-29', freq='D', periods=4) >>> idx.isocalendar() year week day 2019-12-29 2019 52 7 2019-12-30 2020 1 1 2019-12-31 2020 1 2 2020-01-01 2020 1 3 >>> idx.isocalendar().week 2019-12-29 52 2019-12-30 1 2019-12-31 1 2020-01-01 1 Freq: D, Name: week, dtype: UInt32
- snap(freq='S')[source]
Snap time stamps to nearest occurring frequency.
- Return type:
- Parameters:
freq (Frequency) –
- slice_indexer(start=None, end=None, step=None)[source]
Return indexer for specified label slice. Index.slice_indexer, customized to handle time slicing.
In addition to functionality provided by Index.slice_indexer, does the following:
if both start and end are instances of datetime.time, it invokes indexer_between_time
if start and end are both either string or None perform value-based selection in non-monotonic cases.
- indexer_at_time(time, asof=False)[source]
Return index locations of values at particular time of day.
- Parameters:
time (datetime.time or str) – Time passed in either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”, “%I%M%S%p”).
asof (bool) –
- Return type:
np.ndarray[np.intp]
See also
indexer_between_timeGet index locations of values between particular times of day.
DataFrame.at_timeSelect values at particular time of day.
- indexer_between_time(start_time, end_time, include_start=True, include_end=True)[source]
Return index locations of values between particular times of day.
- Parameters:
start_time (datetime.time, str) – Time passed either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”,”%I%M%S%p”).
end_time (datetime.time, str) – Time passed either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”,”%I%M%S%p”).
include_start (bool, default True) –
include_end (bool, default True) –
- Return type:
np.ndarray[np.intp]
See also
indexer_at_timeGet index locations of values at particular time of day.
DataFrame.between_timeSelect values between particular times of day.
- as_unit(*args, **kwargs)
Convert to a dtype with the given unit resolution.
- Parameters:
unit ({'s', 'ms', 'us', 'ns'}) –
- Return type:
same type as self
- ceil(*args, **kwargs)
Perform ceil operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to ceil the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.ceil('H') DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00', '2018-01-01 13:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.ceil("H") 0 2018-01-01 12:00:00 1 2018-01-01 12:00:00 2 2018-01-01 13:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 01:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.ceil("H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.ceil("H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- property date
Returns numpy array of python
datetime.dateobjects.Namely, the date part of Timestamps without time and timezone information.
- property day
The day of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="D") ... ) >>> datetime_series 0 2000-01-01 1 2000-01-02 2 2000-01-03 dtype: datetime64[ns] >>> datetime_series.dt.day 0 1 1 2 2 3 dtype: int32
- day_name(*args, **kwargs)
Return the day names with specified locale.
- Parameters:
locale (str, optional) – Locale determining the language in which to return the day name. Default is English locale (
'en_US.utf8'). Use the commandlocale -aon your terminal on Unix systems to find your locale language code.- Returns:
Series or Index of day names.
- Return type:
Examples
>>> s = pd.Series(pd.date_range(start='2018-01-01', freq='D', periods=3)) >>> s 0 2018-01-01 1 2018-01-02 2 2018-01-03 dtype: datetime64[ns] >>> s.dt.day_name() 0 Monday 1 Tuesday 2 Wednesday dtype: object
>>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3) >>> idx DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D') >>> idx.day_name() Index(['Monday', 'Tuesday', 'Wednesday'], dtype='object')
Using the
localeparameter you can set a different locale language, for example:idx.day_name(locale='pt_BR.utf8')will return day names in Brazilian Portuguese language.>>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3) >>> idx DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D') >>> idx.day_name(locale='pt_BR.utf8') Index(['Segunda', 'Terça', 'Quarta'], dtype='object')
- property day_of_week
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.
See also
Series.dt.dayofweekAlias.
Series.dt.weekdayAlias.
Series.dt.day_nameReturns the name of the day of the week.
Examples
>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series() >>> s.dt.dayofweek 2016-12-31 5 2017-01-01 6 2017-01-02 0 2017-01-03 1 2017-01-04 2 2017-01-05 3 2017-01-06 4 2017-01-07 5 2017-01-08 6 Freq: D, dtype: int32
- property day_of_year
The ordinal day of the year.
- property dayofweek
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.
See also
Series.dt.dayofweekAlias.
Series.dt.weekdayAlias.
Series.dt.day_nameReturns the name of the day of the week.
Examples
>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series() >>> s.dt.dayofweek 2016-12-31 5 2017-01-01 6 2017-01-02 0 2017-01-03 1 2017-01-04 2 2017-01-05 3 2017-01-06 4 2017-01-07 5 2017-01-08 6 Freq: D, dtype: int32
- property dayofyear
The ordinal day of the year.
- property days_in_month
The number of days in the month.
- property daysinmonth
The number of days in the month.
- property dtype
The dtype for the DatetimeArray.
Warning
A future version of pandas will change dtype to never be a
numpy.dtype. Instead,DatetimeArray.dtypewill always be an instance of anExtensionDtypesubclass.- Returns:
If the values are tz-naive, then
np.dtype('datetime64[ns]')is returned.If the values are tz-aware, then the
DatetimeTZDtypeis returned.- Return type:
numpy.dtype or DatetimeTZDtype
- floor(*args, **kwargs)
Perform floor operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.floor('H') DatetimeIndex(['2018-01-01 11:00:00', '2018-01-01 12:00:00', '2018-01-01 12:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.floor("H") 0 2018-01-01 11:00:00 1 2018-01-01 12:00:00 2 2018-01-01 12:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- property hour
The hours of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="h") ... ) >>> datetime_series 0 2000-01-01 00:00:00 1 2000-01-01 01:00:00 2 2000-01-01 02:00:00 dtype: datetime64[ns] >>> datetime_series.dt.hour 0 0 1 1 2 2 dtype: int32
- property is_leap_year
Boolean indicator if the date belongs to a leap year.
A leap year is a year, which has 366 days (instead of 365) including 29th of February as an intercalary day. Leap years are years which are multiples of four with the exception of years divisible by 100 but not by 400.
- Returns:
Booleans indicating if dates belong to a leap year.
- Return type:
Series or ndarray
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> idx = pd.date_range("2012-01-01", "2015-01-01", freq="Y") >>> idx DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31'], dtype='datetime64[ns]', freq='A-DEC') >>> idx.is_leap_year array([ True, False, False])
>>> dates_series = pd.Series(idx) >>> dates_series 0 2012-12-31 1 2013-12-31 2 2014-12-31 dtype: datetime64[ns] >>> dates_series.dt.is_leap_year 0 True 1 False 2 False dtype: bool
- property is_month_end
Indicates whether the date is the last day of the month.
- Returns:
For Series, returns a Series with boolean values. For DatetimeIndex, returns a boolean array.
- Return type:
Series or array
See also
is_month_startReturn a boolean indicating whether the date is the first day of the month.
is_month_endReturn a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> s = pd.Series(pd.date_range("2018-02-27", periods=3)) >>> s 0 2018-02-27 1 2018-02-28 2 2018-03-01 dtype: datetime64[ns] >>> s.dt.is_month_start 0 False 1 False 2 True dtype: bool >>> s.dt.is_month_end 0 False 1 True 2 False dtype: bool
>>> idx = pd.date_range("2018-02-27", periods=3) >>> idx.is_month_start array([False, False, True]) >>> idx.is_month_end array([False, True, False])
- property is_month_start
Indicates whether the date is the first day of the month.
- Returns:
For Series, returns a Series with boolean values. For DatetimeIndex, returns a boolean array.
- Return type:
Series or array
See also
is_month_startReturn a boolean indicating whether the date is the first day of the month.
is_month_endReturn a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> s = pd.Series(pd.date_range("2018-02-27", periods=3)) >>> s 0 2018-02-27 1 2018-02-28 2 2018-03-01 dtype: datetime64[ns] >>> s.dt.is_month_start 0 False 1 False 2 True dtype: bool >>> s.dt.is_month_end 0 False 1 True 2 False dtype: bool
>>> idx = pd.date_range("2018-02-27", periods=3) >>> idx.is_month_start array([False, False, True]) >>> idx.is_month_end array([False, True, False])
- is_normalized
Returns True if all of the dates are at midnight (“no time”)
- property is_quarter_end
Indicator for whether the date is the last day of a quarter.
- Returns:
is_quarter_end – The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.
- Return type:
See also
quarterReturn the quarter of the date.
is_quarter_startSimilar property indicating the quarter start.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30", ... periods=4)}) >>> df.assign(quarter=df.dates.dt.quarter, ... is_quarter_end=df.dates.dt.is_quarter_end) dates quarter is_quarter_end 0 2017-03-30 1 False 1 2017-03-31 1 True 2 2017-04-01 2 False 3 2017-04-02 2 False
>>> idx = pd.date_range('2017-03-30', periods=4) >>> idx DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'], dtype='datetime64[ns]', freq='D')
>>> idx.is_quarter_end array([False, True, False, False])
- property is_quarter_start
Indicator for whether the date is the first day of a quarter.
- Returns:
is_quarter_start – The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.
- Return type:
See also
quarterReturn the quarter of the date.
is_quarter_endSimilar property for indicating the quarter end.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30", ... periods=4)}) >>> df.assign(quarter=df.dates.dt.quarter, ... is_quarter_start=df.dates.dt.is_quarter_start) dates quarter is_quarter_start 0 2017-03-30 1 False 1 2017-03-31 1 False 2 2017-04-01 2 True 3 2017-04-02 2 False
>>> idx = pd.date_range('2017-03-30', periods=4) >>> idx DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'], dtype='datetime64[ns]', freq='D')
>>> idx.is_quarter_start array([False, False, True, False])
- property is_year_end
Indicate whether the date is the last day of the year.
- Returns:
The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.
- Return type:
See also
is_year_startSimilar property indicating the start of the year.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> dates = pd.Series(pd.date_range("2017-12-30", periods=3)) >>> dates 0 2017-12-30 1 2017-12-31 2 2018-01-01 dtype: datetime64[ns]
>>> dates.dt.is_year_end 0 False 1 True 2 False dtype: bool
>>> idx = pd.date_range("2017-12-30", periods=3) >>> idx DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'], dtype='datetime64[ns]', freq='D')
>>> idx.is_year_end array([False, True, False])
- property is_year_start
Indicate whether the date is the first day of a year.
- Returns:
The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.
- Return type:
See also
is_year_endSimilar property indicating the last day of the year.
Examples
This method is available on Series with datetime values under the
.dtaccessor, and directly on DatetimeIndex.>>> dates = pd.Series(pd.date_range("2017-12-30", periods=3)) >>> dates 0 2017-12-30 1 2017-12-31 2 2018-01-01 dtype: datetime64[ns]
>>> dates.dt.is_year_start 0 False 1 False 2 True dtype: bool
>>> idx = pd.date_range("2017-12-30", periods=3) >>> idx DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'], dtype='datetime64[ns]', freq='D')
>>> idx.is_year_start array([False, False, True])
- property microsecond
The microseconds of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="us") ... ) >>> datetime_series 0 2000-01-01 00:00:00.000000 1 2000-01-01 00:00:00.000001 2 2000-01-01 00:00:00.000002 dtype: datetime64[ns] >>> datetime_series.dt.microsecond 0 0 1 1 2 2 dtype: int32
- property minute
The minutes of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="T") ... ) >>> datetime_series 0 2000-01-01 00:00:00 1 2000-01-01 00:01:00 2 2000-01-01 00:02:00 dtype: datetime64[ns] >>> datetime_series.dt.minute 0 0 1 1 2 2 dtype: int32
- property month
The month as January=1, December=12.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="M") ... ) >>> datetime_series 0 2000-01-31 1 2000-02-29 2 2000-03-31 dtype: datetime64[ns] >>> datetime_series.dt.month 0 1 1 2 2 3 dtype: int32
- month_name(*args, **kwargs)
Return the month names with specified locale.
- Parameters:
locale (str, optional) – Locale determining the language in which to return the month name. Default is English locale (
'en_US.utf8'). Use the commandlocale -aon your terminal on Unix systems to find your locale language code.- Returns:
Series or Index of month names.
- Return type:
Examples
>>> s = pd.Series(pd.date_range(start='2018-01', freq='M', periods=3)) >>> s 0 2018-01-31 1 2018-02-28 2 2018-03-31 dtype: datetime64[ns] >>> s.dt.month_name() 0 January 1 February 2 March dtype: object
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3) >>> idx DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'], dtype='datetime64[ns]', freq='M') >>> idx.month_name() Index(['January', 'February', 'March'], dtype='object')
Using the
localeparameter you can set a different locale language, for example:idx.month_name(locale='pt_BR.utf8')will return month names in Brazilian Portuguese language.>>> idx = pd.date_range(start='2018-01', freq='M', periods=3) >>> idx DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'], dtype='datetime64[ns]', freq='M') >>> idx.month_name(locale='pt_BR.utf8') Index(['Janeiro', 'Fevereiro', 'Março'], dtype='object')
- property nanosecond
The nanoseconds of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="ns") ... ) >>> datetime_series 0 2000-01-01 00:00:00.000000000 1 2000-01-01 00:00:00.000000001 2 2000-01-01 00:00:00.000000002 dtype: datetime64[ns] >>> datetime_series.dt.nanosecond 0 0 1 1 2 2 dtype: int32
- normalize(*args, **kwargs)
Convert times to midnight.
The time component of the date-time is converted to midnight i.e. 00:00:00. This is useful in cases, when the time does not matter. Length is unaltered. The timezones are unaffected.
This method is available on Series with datetime values under the
.dtaccessor, and directly on Datetime Array/Index.- Returns:
The same type as the original data. Series will have the same name and index. DatetimeIndex will have the same name.
- Return type:
DatetimeArray, DatetimeIndex or Series
See also
Examples
>>> idx = pd.date_range(start='2014-08-01 10:00', freq='H', ... periods=3, tz='Asia/Calcutta') >>> idx DatetimeIndex(['2014-08-01 10:00:00+05:30', '2014-08-01 11:00:00+05:30', '2014-08-01 12:00:00+05:30'], dtype='datetime64[ns, Asia/Calcutta]', freq='H') >>> idx.normalize() DatetimeIndex(['2014-08-01 00:00:00+05:30', '2014-08-01 00:00:00+05:30', '2014-08-01 00:00:00+05:30'], dtype='datetime64[ns, Asia/Calcutta]', freq=None)
- property quarter
The quarter of the date.
- round(*args, **kwargs)
Perform round operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to round the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.round('H') DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00', '2018-01-01 12:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.round("H") 0 2018-01-01 12:00:00 1 2018-01-01 12:00:00 2 2018-01-01 12:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- property second
The seconds of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="s") ... ) >>> datetime_series 0 2000-01-01 00:00:00 1 2000-01-01 00:00:01 2 2000-01-01 00:00:02 dtype: datetime64[ns] >>> datetime_series.dt.second 0 0 1 1 2 2 dtype: int32
- std(*args, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters:
axis (int optional, default None) – Axis for the function to be applied on. For Series this parameter is unused and defaults to None.
ddof (int, default 1) – Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- Return type:
- property time
Returns numpy array of
datetime.timeobjects.The time part of the Timestamps.
- property timetz
Returns numpy array of
datetime.timeobjects with timezones.The time part of the Timestamps.
- to_pydatetime(*args, **kwargs)
Return an ndarray of datetime.datetime objects.
- Return type:
numpy.ndarray
- property tzinfo
Alias for tz attribute
- property weekday
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.
See also
Series.dt.dayofweekAlias.
Series.dt.weekdayAlias.
Series.dt.day_nameReturns the name of the day of the week.
Examples
>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series() >>> s.dt.dayofweek 2016-12-31 5 2017-01-01 6 2017-01-02 0 2017-01-03 1 2017-01-04 2 2017-01-05 3 2017-01-06 4 2017-01-07 5 2017-01-08 6 Freq: D, dtype: int32
- property year
The year of the datetime.
Examples
>>> datetime_series = pd.Series( ... pd.date_range("2000-01-01", periods=3, freq="Y") ... ) >>> datetime_series 0 2000-12-31 1 2001-12-31 2 2002-12-31 dtype: datetime64[ns] >>> datetime_series.dt.year 0 2000 1 2001 2 2002 dtype: int32
- class pandas.DatetimeTZDtype[source]
An ExtensionDtype for timezone-aware datetime data.
This is not an actual numpy dtype, but a duck type.
- Parameters:
unit (str, default "ns") – The precision of the datetime data. Currently limited to
"ns".tz (str, int, or datetime.tzinfo) – The timezone.
- unit
- tz
- None()
- Raises:
pytz.UnknownTimeZoneError – When the requested timezone cannot be found.
- Parameters:
unit (str_type | DatetimeTZDtype) –
Examples
>>> pd.DatetimeTZDtype(tz='UTC') datetime64[ns, UTC]
>>> pd.DatetimeTZDtype(tz='dateutil/US/Central') datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
- num = 101
- property na_value: NaTType
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- property tz: tzinfo
The timezone.
- classmethod construct_array_type()[source]
Return the array type associated with this dtype.
- Return type:
- classmethod construct_from_string(string)[source]
Construct a DatetimeTZDtype from a string.
- Parameters:
string (str) – The string alias for this DatetimeTZDtype. Should be formatted like
datetime64[ns, <tz>], where<tz>is the timezone name.- Return type:
Examples
>>> DatetimeTZDtype.construct_from_string('datetime64[ns, UTC]') datetime64[ns, UTC]
- class pandas.ExcelFile[source]
Class for parsing tabular Excel sheets into DataFrame objects.
See read_excel for more documentation.
- Parameters:
path_or_buffer (str, bytes, path object (pathlib.Path or py._path.local.LocalPath),) – A file-like object, xlrd workbook or openpyxl workbook. If a string or path object, expected to be a path to a .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
engine (str, default None) –
If io is not a buffer or path, this must be set to identify io. Supported engines:
xlrd,openpyxl,odf,pyxlsbEngine compatibility :xlrdsupports old-style Excel files (.xls).openpyxlsupports newer Excel file formats.odfsupports OpenDocument file formats (.odf, .ods, .odt).pyxlsbsupports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style
.xlsfiles. Whenengine=None, the following logic will be used to determine the engine:If
path_or_bufferis an OpenDocument format (.odf, .ods, .odt), then odf will be used.Otherwise if
path_or_bufferis an xls format,xlrdwill be used.Otherwise if
path_or_bufferis in xlsb format, pyxlsb will be used.
New in version 1.3.0.
Otherwise if openpyxl is installed, then
openpyxlwill be used.Otherwise if
xlrd >= 2.0is installed, aValueErrorwill be raised.
Warning
Please do not report issues when using
xlrdto read.xlsxfiles. This is not supported, switch to usingopenpyxlinstead.storage_options (StorageOptions) –
- class ODFReader
- Parameters:
filepath_or_buffer (FilePath | ReadBuffer[bytes]) –
storage_options (StorageOptions) –
- get_sheet_data(sheet, file_rows_needed=None)
Parse an ODF Table into a list of lists
- class OpenpyxlReader
- Parameters:
filepath_or_buffer (FilePath | ReadBuffer[bytes]) –
storage_options (StorageOptions) –
- get_sheet_data(sheet, file_rows_needed=None)
- class PyxlsbReader
- Parameters:
filepath_or_buffer (FilePath | ReadBuffer[bytes]) –
storage_options (StorageOptions) –
- get_sheet_data(sheet, file_rows_needed=None)
- class XlrdReader
- Parameters:
storage_options (StorageOptions) –
- get_sheet_by_index(index)
- get_sheet_by_name(name)
- get_sheet_data(sheet, file_rows_needed=None)
- load_workbook(filepath_or_buffer)
- property sheet_names
- parse(sheet_name=0, header=0, names=None, index_col=None, usecols=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, parse_dates=False, date_parser=_NoDefault.no_default, date_format=None, thousands=None, comment=None, skipfooter=0, dtype_backend=_NoDefault.no_default, **kwds)[source]
Parse specified sheet(s) into a DataFrame.
Equivalent to read_excel(ExcelFile, …) See the read_excel docstring for more info on accepted parameters.
- Returns:
DataFrame from the passed in Excel file.
- Return type:
- Parameters:
- property book
- property sheet_names
- class pandas.ExcelWriter[source]
Class for writing DataFrame objects into excel sheets.
Default is to use:
xlsxwriter for xlsx files if xlsxwriter is installed otherwise openpyxl
odswriter for ods files
See
DataFrame.to_excelfor typical usage.The writer should be used as a context manager. Otherwise, call close() to save and close any opened file handles.
- Parameters:
engine (str (optional)) – Engine to use for writing. If None, defaults to
io.excel.<extension>.writer. NOTE: can only be passed as a keyword argument.date_format (str, default None) – Format string for dates written into Excel files (e.g. ‘YYYY-MM-DD’).
datetime_format (str, default None) – Format string for datetime objects written into Excel files. (e.g. ‘YYYY-MM-DD HH:MM:SS’).
mode ({'w', 'a'}, default 'w') – File mode to use (write or append). Append does not work with fsspec URLs.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
if_sheet_exists ({'error', 'new', 'replace', 'overlay'}, default 'error') –
How to behave when trying to write to a sheet that already exists (append mode only).
error: raise a ValueError.
new: Create a new sheet, with a name determined by the engine.
replace: Delete the contents of the sheet before writing to it.
overlay: Write contents to the existing sheet without removing the old contents.
New in version 1.3.0.
Changed in version 1.4.0: Added
overlayoptionengine_kwargs (dict, optional) –
Keyword arguments to be passed into the engine. These will be passed to the following functions of the respective engines:
xlsxwriter:
xlsxwriter.Workbook(file, **engine_kwargs)openpyxl (write mode):
openpyxl.Workbook(**engine_kwargs)openpyxl (append mode):
openpyxl.load_workbook(file, **engine_kwargs)odswriter:
odf.opendocument.OpenDocumentSpreadsheet(**engine_kwargs)
New in version 1.3.0.
- Return type:
Notes
For compatibility with CSV writers, ExcelWriter serializes lists and dicts to strings before writing.
Examples
Default usage:
>>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"]) >>> with pd.ExcelWriter("path_to_file.xlsx") as writer: ... df.to_excel(writer)
To write to separate sheets in a single file:
>>> df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"]) >>> df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"]) >>> with pd.ExcelWriter("path_to_file.xlsx") as writer: ... df1.to_excel(writer, sheet_name="Sheet1") ... df2.to_excel(writer, sheet_name="Sheet2")
You can set the date format or datetime format:
>>> from datetime import date, datetime >>> df = pd.DataFrame( ... [ ... [date(2014, 1, 31), date(1999, 9, 24)], ... [datetime(1998, 5, 26, 23, 33, 4), datetime(2014, 2, 28, 13, 5, 13)], ... ], ... index=["Date", "Datetime"], ... columns=["X", "Y"], ... ) >>> with pd.ExcelWriter( ... "path_to_file.xlsx", ... date_format="YYYY-MM-DD", ... datetime_format="YYYY-MM-DD HH:MM:SS" ... ) as writer: ... df.to_excel(writer)
You can also append to an existing Excel file:
>>> with pd.ExcelWriter("path_to_file.xlsx", mode="a", engine="openpyxl") as writer: ... df.to_excel(writer, sheet_name="Sheet3")
Here, the if_sheet_exists parameter can be set to replace a sheet if it already exists:
>>> with ExcelWriter( ... "path_to_file.xlsx", ... mode="a", ... engine="openpyxl", ... if_sheet_exists="replace", ... ) as writer: ... df.to_excel(writer, sheet_name="Sheet1")
You can also write multiple DataFrames to a single sheet. Note that the
if_sheet_existsparameter needs to be set tooverlay:>>> with ExcelWriter("path_to_file.xlsx", ... mode="a", ... engine="openpyxl", ... if_sheet_exists="overlay", ... ) as writer: ... df1.to_excel(writer, sheet_name="Sheet1") ... df2.to_excel(writer, sheet_name="Sheet1", startcol=3)
You can store Excel file in RAM:
>>> import io >>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"]) >>> buffer = io.BytesIO() >>> with pd.ExcelWriter(buffer) as writer: ... df.to_excel(writer)
You can pack Excel file into zip archive:
>>> import zipfile >>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"]) >>> with zipfile.ZipFile("path_to_file.zip", "w") as zf: ... with zf.open("filename.xlsx", "w") as buffer: ... with pd.ExcelWriter(buffer) as writer: ... df.to_excel(writer)
You can specify additional arguments to the underlying engine:
>>> with pd.ExcelWriter( ... "path_to_file.xlsx", ... engine="xlsxwriter", ... engine_kwargs={"options": {"nan_inf_to_errors": True}} ... ) as writer: ... df.to_excel(writer)
In append mode,
engine_kwargsare passed through to openpyxl’sload_workbook:>>> with pd.ExcelWriter( ... "path_to_file.xlsx", ... engine="openpyxl", ... mode="a", ... engine_kwargs={"keep_vba": True} ... ) as writer: ... df.to_excel(writer, sheet_name="Sheet2")
- abstract property book
Book instance. Class type will depend on the engine used.
This attribute can be used to access engine-specific features.
- property datetime_format: str
Format string for dates written into Excel files (e.g. ‘YYYY-MM-DD’).
- property if_sheet_exists: str
How to behave when writing to a sheet that already exists in append mode.
- class pandas.Flags[source]
- property allows_duplicate_labels: bool
Whether this object allows duplicate labels.
Setting
allows_duplicate_labels=Falseensures that the index (and columns of a DataFrame) are unique. Most methods that accept and return a Series or DataFrame will propagate the value ofallows_duplicate_labels.See duplicates for more.
See also
DataFrame.attrsSet global metadata on this object.
DataFrame.set_flagsSet global flags on this object.
Examples
>>> df = pd.DataFrame({"A": [1, 2]}, index=['a', 'a']) >>> df.flags.allows_duplicate_labels True >>> df.flags.allows_duplicate_labels = False Traceback (most recent call last): ... pandas.errors.DuplicateLabelError: Index has duplicates. positions label a [0, 1]
- class pandas.Float32Dtype[source]
An ExtensionDtype for float32 data.
This dtype uses
pd.NAas missing value indicator.- None
- None()
- type
alias of
float32
- class pandas.Float64Dtype[source]
An ExtensionDtype for float64 data.
This dtype uses
pd.NAas missing value indicator.- None
- None()
- type
alias of
float64
- class pandas.Grouper[source]
A Grouper allows the user to specify a groupby instruction for an object.
This specification will select a column via the key parameter, or if the level and/or axis parameters are given, a level of the index of the target object.
If axis and/or level are passed as keywords to both Grouper and groupby, the values passed to Grouper take precedence.
- Parameters:
key (str, defaults to None) – Groupby key, which selects the grouping column of the target.
level (name/number, defaults to None) – The level for the target index.
freq (str / frequency object, defaults to None) –
This will groupby the specified frequency if the target selection (via key or level) is a datetime-like object. For full specification of available frequencies, please see here.
sort (bool, default to False) – Whether to sort the resulting labels.
closed ({'left' or 'right'}) – Closed end of interval. Only when freq parameter is passed.
label ({'left' or 'right'}) – Interval boundary to use for labeling. Only when freq parameter is passed.
convention ({'start', 'end', 'e', 's'}) – If grouper is PeriodIndex and freq parameter is passed.
origin (Timestamp or str, default 'start_day') –
The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:
’epoch’: origin is 1970-01-01
’start’: origin is the first value of the timeseries
’start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
’end’: origin is the last value of the timeseries
’end_day’: origin is the ceiling midnight of the last day
New in version 1.3.0.
offset (Timedelta or str, default is None) –
An offset timedelta added to the origin.
New in version 1.1.0.
dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
New in version 1.2.0.
- Return type:
A specification for a groupby instruction
Examples
Syntactic sugar for
df.groupby('A')>>> df = pd.DataFrame( ... { ... "Animal": ["Falcon", "Parrot", "Falcon", "Falcon", "Parrot"], ... "Speed": [100, 5, 200, 300, 15], ... } ... ) >>> df Animal Speed 0 Falcon 100 1 Parrot 5 2 Falcon 200 3 Falcon 300 4 Parrot 15 >>> df.groupby(pd.Grouper(key="Animal")).mean() Speed Animal Falcon 200.0 Parrot 10.0
Specify a resample operation on the column ‘Publish date’
>>> df = pd.DataFrame( ... { ... "Publish date": [ ... pd.Timestamp("2000-01-02"), ... pd.Timestamp("2000-01-02"), ... pd.Timestamp("2000-01-09"), ... pd.Timestamp("2000-01-16") ... ], ... "ID": [0, 1, 2, 3], ... "Price": [10, 20, 30, 40] ... } ... ) >>> df Publish date ID Price 0 2000-01-02 0 10 1 2000-01-02 1 20 2 2000-01-09 2 30 3 2000-01-16 3 40 >>> df.groupby(pd.Grouper(key="Publish date", freq="1W")).mean() ID Price Publish date 2000-01-02 0.5 15.0 2000-01-09 2.0 30.0 2000-01-16 3.0 40.0
If you want to adjust the start of the bins based on a fixed timestamp:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00' >>> rng = pd.date_range(start, end, freq='7min') >>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng) >>> ts 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum() 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', origin='epoch')).sum() 2000-10-01 23:18:00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', origin='2000-01-01')).sum() 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.groupby(pd.Grouper(freq='17min', origin='start')).sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', offset='23h30min')).sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:
>>> ts.groupby(pd.Grouper(freq='17min', offset='2min')).sum() 2000-10-01 23:16:00 0 2000-10-01 23:33:00 9 2000-10-01 23:50:00 36 2000-10-02 00:07:00 39 2000-10-02 00:24:00 24 Freq: 17T, dtype: int64
- property indexer
- property obj
- property grouper
- property groups
- class pandas.HDFStore[source]
Dict-like IO interface for storing pandas objects in PyTables.
Either Fixed or Table format.
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- Parameters:
path (str) – File path to HDF5 file.
mode ({'a', 'w', 'r', 'r+'}, default 'a') –
'r'Read-only; no data can be modified.
'w'Write; a new file is created (an existing file with the same name would be deleted).
'a'Append; an existing file is opened for reading and writing, and if the file does not exist it is created.
'r+'It is similar to
'a', but the file must already exist.
complevel (int, 0-9, default None) – Specifies a compression level for data. A value of 0 or None disables compression.
complib ({'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib') –
Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’,
’blosc:zlib’, ‘blosc:zstd’}.
Specifying a compression library which is not available issues a ValueError.
fletcher32 (bool, default False) – If applying compression use the fletcher32 checksum.
**kwargs – These parameters will be passed to the PyTables open_file method.
Examples
>>> bar = pd.DataFrame(np.random.randn(10, 4)) >>> store = pd.HDFStore('test.h5') >>> store['foo'] = bar # write to HDF5 >>> bar = store['foo'] # retrieve >>> store.close()
Create or load HDF5 file in-memory
When passing the driver option to the PyTables open_file method through **kwargs, the HDF5 file is loaded or created in-memory and will only be written when closed:
>>> bar = pd.DataFrame(np.random.randn(10, 4)) >>> store = pd.HDFStore('test.h5', driver='H5FD_CORE') >>> store['foo'] = bar >>> store.close() # only now, data is written to disk
- property root
return the root node
- keys(include='pandas')[source]
Return a list of keys corresponding to objects stored in HDFStore.
- Parameters:
include (str, default 'pandas') –
When kind equals ‘pandas’ return pandas objects. When kind equals ‘native’ return native HDF5 Table objects.
New in version 1.1.0.
- Returns:
List of ABSOLUTE path-names (e.g. have the leading ‘/’).
- Return type:
- Raises:
raises ValueError if kind has an illegal value –
- open(mode='a', **kwargs)[source]
Open the file in the specified mode
- Parameters:
mode ({'a', 'w', 'r', 'r+'}, default 'a') – See HDFStore docstring or tables.open_file for info about modes
**kwargs – These parameters will be passed to the PyTables open_file method.
- Return type:
None
- flush(fsync=False)[source]
Force all buffered modifications to be written to disk.
- Parameters:
fsync (bool (default False)) – call
os.fsync()on the file handle to force writing to disk.- Return type:
None
Notes
Without
fsync=True, flushing may not guarantee that the OS writes to disk. With fsync, the operation will block until the OS claims the file has been written; however, other caching layers may still interfere.
- select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)[source]
Retrieve pandas object stored in file, optionally based on where criteria.
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- Parameters:
key (str) – Object being retrieved from file.
where (list or None) – List of Term (or convertible) objects, optional.
start (int or None) – Row number to start selection.
stop (int, default None) – Row number to stop selection.
columns (list or None) – A list of columns that if not None, will limit the return columns.
iterator (bool or False) – Returns an iterator.
chunksize (int or None) – Number or rows to include in iteration, return an iterator.
auto_close (bool or False) – Should automatically close the store when finished.
- Returns:
Retrieved object from file.
- Return type:
- select_as_coordinates(key, where=None, start=None, stop=None)[source]
return the selection as an Index
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- select_column(key, column, start=None, stop=None)[source]
return a single column from the table. This is generally only useful to select an indexable
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- Parameters:
- Raises:
raises KeyError if the column is not found (or key is not a valid – store)
raises ValueError if the column can not be extracted individually (it – is part of a data block)
- select_as_multiple(keys, where=None, selector=None, columns=None, start=None, stop=None, iterator=False, chunksize=None, auto_close=False)[source]
Retrieve pandas objects from multiple tables.
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- Parameters:
keys (a list of the tables) –
selector (the table to apply the where criteria (defaults to keys[0]) – if not supplied)
columns (the columns I want back) –
start (integer (defaults to None), row number to start selection) –
stop (integer (defaults to None), row number to stop selection) –
iterator (bool, return an iterator, default False) –
chunksize (nrows to include in iteration, return an iterator) –
auto_close (bool, default False) – Should automatically close the store when finished.
- Raises:
raises KeyError if keys or selector is not found or keys is empty –
raises TypeError if keys is not a list or tuple –
raises ValueError if the tables are not ALL THE SAME DIMENSIONS –
- put(key, value, format=None, index=True, append=False, complib=None, complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict', track_times=True, dropna=False)[source]
Store object in HDFStore.
- Parameters:
key (str) –
value ({Series, DataFrame}) –
format ('fixed(f)|table(t)', default is 'fixed') –
Format to use when storing object in HDFStore. Value can be one of:
'fixed'Fixed format. Fast writing/reading. Not-appendable, nor searchable.
'table'Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.
index (bool, default True) – Write DataFrame index as a column.
append (bool, default False) – This will force Table format, append the input data to the existing.
data_columns (list of columns or True, default None) – List of columns to create as data columns, or True to use all columns. See here.
encoding (str, default None) – Provide an encoding for strings.
track_times (bool, default True) – Parameter is propagated to ‘create_table’ method of ‘PyTables’. If set to False it enables to have the same h5 files (same hashes) independent on creation time.
dropna (bool, default False, optional) –
Remove missing values.
New in version 1.1.0.
complevel (int | None) –
errors (str) –
- Return type:
None
- remove(key, where=None, start=None, stop=None)[source]
Remove pandas object partially by specifying the where condition
- Parameters:
- Return type:
number of rows removed (or None if not a Table)
- Raises:
raises KeyError if key is not a valid store –
- append(key, value, format=None, axes=None, index=True, append=True, complib=None, complevel=None, columns=None, min_itemsize=None, nan_rep=None, chunksize=None, expectedrows=None, dropna=None, data_columns=None, encoding=None, errors='strict')[source]
Append to Table in file.
Node must already exist and be Table format.
- Parameters:
key (str) –
value ({Series, DataFrame}) –
format ('table' is the default) –
Format to use when storing object in HDFStore. Value can be one of:
'table'Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.
index (bool, default True) – Write DataFrame index as a column.
append (bool, default True) – Append the input data to the existing.
data_columns (list of columns, or True, default None) – List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See here.
min_itemsize (dict of columns that specify minimum str sizes) –
nan_rep (str to use as str nan representation) –
chunksize (size to chunk the writing) –
expectedrows (expected TOTAL row size of this table) –
encoding (default None, provide an encoding for str) –
dropna (bool, default False, optional) – Do not write an ALL nan row to the store settable by the option ‘io.hdf.dropna_table’.
complevel (int | None) –
errors (str) –
- Return type:
None
Notes
Does not check if data being appended overlaps with existing data in the table, so be careful
- append_to_multiple(d, value, selector, data_columns=None, axes=None, dropna=False, **kwargs)[source]
Append to multiple tables
- Parameters:
d (a dict of table_name to table_columns, None is acceptable as the) – values of one node (this will get all the remaining columns)
value (a pandas object) –
selector (a string that designates the indexable table; all of its) – columns will be designed as data_columns, unless data_columns is passed, in which case these are used
data_columns (list of columns to create as data columns, or True to) – use all columns
dropna (if evaluates to True, drop rows from all tables if any single) – row in each table has all NaN. Default False.
- Return type:
None
Notes
axes parameter is currently not accepted
- create_table_index(key, columns=None, optlevel=None, kind=None)[source]
Create a pytables index on the table.
- Parameters:
key (str) –
columns (None, bool, or listlike[str]) –
Indicate which columns to create an index on.
False : Do not create any indexes.
True : Create indexes on all columns.
None : Create indexes on all columns.
listlike : Create indexes on the given columns.
optlevel (int or None, default None) – Optimization level, if None, pytables defaults to 6.
kind (str or None, default None) – Kind of index, if None, pytables defaults to “medium”.
- Raises:
TypeError – raises if the node is not a table:
- Return type:
None
- groups()[source]
Return a list of all the top-level nodes.
Each node returned is not a pandas storage object.
- Returns:
List of objects.
- Return type:
- walk(where='/')[source]
Walk the pytables group hierarchy for pandas objects.
This generator will yield the group path, subgroups and pandas object names for each group.
Any non-pandas PyTables objects that are not a group will be ignored.
The where group itself is listed first (preorder), then each of its child groups (following an alphanumerical order) is also traversed, following the same procedure.
- Parameters:
where (str, default "/") – Group where to start walking.
- Yields:
path (str) – Full path to a group (without trailing ‘/’).
groups (list) – Names (strings) of the groups contained in path.
leaves (list) – Names (strings) of the pandas objects contained in path.
- Return type:
- get_node(key)[source]
return the node with the key or None if it does not exist
- Parameters:
key (str) –
- Return type:
Node | None
- get_storer(key)[source]
return the storer object for a key, raise if not in the file
- Parameters:
key (str) –
- Return type:
GenericFixed | Table
- copy(file, mode='w', propindexes=True, keys=None, complib=None, complevel=None, fletcher32=False, overwrite=True)[source]
Copy the existing store to a new file, updating in place.
- Parameters:
propindexes (bool, default True) – Restore indexes in copied file.
keys (list, optional) – List of keys to include in the copy (defaults to all).
overwrite (bool, default True) – Whether to overwrite (remove and replace) existing nodes in the new store.
mode (str) –
complib –
complevel (int | None) –
HDFStore.__init__ (fletcher32 same as in) –
fletcher32 (bool) –
- Return type:
open file handle of the new store
- class pandas.Index[source]
Immutable sequence used for indexing and alignment.
The basic object storing axis labels for all pandas objects.
Changed in version 2.0.0: Index can hold all numpy numeric dtypes (except float16). Previously only int64/uint64/float64 dtypes were accepted.
- Parameters:
data (array-like (1-dimensional)) –
dtype (NumPy dtype (default: object)) – If dtype is None, we find the dtype that best fits the data. If an actual dtype is provided, we coerce to that dtype if it’s safe. Otherwise, an error will be raised.
copy (bool) – Make a copy of input ndarray.
name (object) – Name to be stored in the index.
tupleize_cols (bool (default: True)) – When True, attempt to create a MultiIndex if possible.
- Return type:
See also
RangeIndexIndex implementing a monotonic integer range.
CategoricalIndexIndex of
Categoricals.MultiIndexA multi-level, or hierarchical Index.
IntervalIndexAn Index of
Intervals.DatetimeIndexIndex of datetime64 data.
TimedeltaIndexIndex of timedelta64 data.
PeriodIndexIndex of Period data.
Notes
An Index instance can only contain hashable objects. An Index instance can not hold numpy float16 dtype.
Examples
>>> pd.Index([1, 2, 3]) Index([1, 2, 3], dtype='int64')
>>> pd.Index(list('abc')) Index(['a', 'b', 'c'], dtype='object')
>>> pd.Index([1, 2, 3], dtype="uint8") Index([1, 2, 3], dtype='uint8')
- str
alias of
StringMethods
- final is_(other)[source]
More flexible, faster check like
isbut that works through views.Note: this is not the same as
Index.identical(), which checks that metadata is also the same.- Parameters:
other (object) – Other object to compare against.
- Returns:
True if both have same underlying data, False otherwise.
- Return type:
See also
Index.identicalWorks like
Index.is_but also checks metadata.
- dtype
Return the dtype object of the underlying data.
- final ravel(order='C')[source]
Return a view on self.
See also
numpy.ndarray.ravelReturn a flattened array.
- astype(dtype, copy=True)[source]
Create an Index with values cast to dtypes.
The class of a new Index is determined by dtype. When conversion is impossible, a TypeError exception is raised.
- Parameters:
dtype (numpy dtype or pandas type) – Note that any signed integer dtype is treated as
'int64', and any unsigned integer dtype is treated as'uint64', regardless of the size.copy (bool, default True) – By default, astype always returns a newly allocated object. If copy is set to False and internal requirements on dtype are satisfied, the original data is used to create a new Index or the original Index is returned.
- Returns:
Index with values cast to specified dtype.
- Return type:
- take(indices, axis=0, allow_fill=True, fill_value=None, **kwargs)[source]
Return a new Index of the values selected by the indices.
For internal compatibility with numpy arrays.
- Parameters:
indices (array-like) – Indices to be taken.
axis (int, optional) – The axis over which to select values, always 0.
allow_fill (bool, default True) –
fill_value (scalar, default None) – If allow_fill=True and fill_value is not None, indices specified by -1 are regarded as NA. If Index doesn’t hold NA, raise ValueError.
- Returns:
An index formed of elements at the given indices. Will be the same type as self, except for RangeIndex.
- Return type:
See also
numpy.ndarray.takeReturn an array formed from the elements of a at the given indices.
- repeat(repeats, axis=None)[source]
Repeat elements of a Index.
Returns a new Index where each element of the current Index is repeated consecutively a given number of times.
- Parameters:
repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Index.
axis (None) – Must be
None. Has no effect but is accepted for compatibility with numpy.
- Returns:
Newly created Index with repeated elements.
- Return type:
See also
Series.repeatEquivalent function for Series.
numpy.repeatSimilar method for
numpy.ndarray.
Examples
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx Index(['a', 'b', 'c'], dtype='object') >>> idx.repeat(2) Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object') >>> idx.repeat([1, 2, 3]) Index(['a', 'b', 'b', 'c', 'c', 'c'], dtype='object')
- copy(name=None, deep=False)[source]
Make a copy of this object.
Name is set on the new object.
- Parameters:
name (Label, optional) – Set name for new object.
deep (bool, default False) –
self (_IndexT) –
- Returns:
Index refer to new object which is a copy of this object.
- Return type:
Notes
In most cases, there should be no functional difference from using
deep, but ifdeepis passed it will attempt to deepcopy.
- format(name=False, formatter=None, na_rep='NaN')[source]
Render a string representation of the Index.
- to_flat_index()[source]
Identity method.
This is implemented for compatibility with subclass implementations when chaining.
- Returns:
Caller.
- Return type:
pd.Index
- Parameters:
self (_IndexT) –
See also
MultiIndex.to_flat_indexSubclass implementation.
- final to_series(index=None, name=None)[source]
Create a Series with both index and values equal to the index keys.
Useful with map for returning an indexer based on an index.
- Parameters:
- Returns:
The dtype will be based on the type of the Index values.
- Return type:
See also
Index.to_frameConvert an Index to a DataFrame.
Series.to_frameConvert Series to DataFrame.
Examples
>>> idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal')
By default, the original Index and original name is reused.
>>> idx.to_series() animal Ant Ant Bear Bear Cow Cow Name: animal, dtype: object
To enforce a new Index, specify new labels to
index:>>> idx.to_series(index=[0, 1, 2]) 0 Ant 1 Bear 2 Cow Name: animal, dtype: object
To override the name of the resulting column, specify name:
>>> idx.to_series(name='zoo') animal Ant Ant Bear Bear Cow Cow Name: zoo, dtype: object
- to_frame(index=True, name=_NoDefault.no_default)[source]
Create a DataFrame with a column containing the Index.
- Parameters:
- Returns:
DataFrame containing the original Index data.
- Return type:
See also
Index.to_seriesConvert an Index to a Series.
Series.to_frameConvert Series to DataFrame.
Examples
>>> idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal') >>> idx.to_frame() animal animal Ant Ant Bear Bear Cow Cow
By default, the original Index is reused. To enforce a new Index:
>>> idx.to_frame(index=False) animal 0 Ant 1 Bear 2 Cow
To override the name of the resulting column, specify name:
>>> idx.to_frame(index=False, name='zoo') zoo 0 Ant 1 Bear 2 Cow
- property names: FrozenList
- set_names(names, *, level=None, inplace: Literal[False] = False) _IndexT[source]
- set_names(names, *, level=None, inplace: Literal[True]) None
- set_names(names, *, level=None, inplace: bool = False) _IndexT | None
Set Index or MultiIndex name.
Able to set new names partially and by level.
- Parameters:
names (label or list of label or dict-like for MultiIndex) –
Name(s) to set.
Changed in version 1.3.0.
level (int, label or list of int or label, optional) –
If the index is a MultiIndex and names is not dict-like, level(s) to set (None for all levels). Otherwise level must be None.
Changed in version 1.3.0.
inplace (bool, default False) – Modifies the object directly, instead of creating a new Index or MultiIndex.
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
Index or None
See also
Index.renameAble to set new names without level.
Examples
>>> idx = pd.Index([1, 2, 3, 4]) >>> idx Index([1, 2, 3, 4], dtype='int64') >>> idx.set_names('quarter') Index([1, 2, 3, 4], dtype='int64', name='quarter')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'], ... [2018, 2019]]) >>> idx MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], ) >>> idx = idx.set_names(['kind', 'year']) >>> idx.set_names('species', level=0) MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['species', 'year'])
When renaming levels with a dict, levels can not be passed.
>>> idx.set_names({'kind': 'snake'}) MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['snake', 'year'])
- rename(name, inplace=False)[source]
Alter Index or MultiIndex name.
Able to set new names without level. Defaults to returning new index. Length of names must match number of levels in MultiIndex.
- Parameters:
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
Index or None
See also
Index.set_namesAble to set new names partially and by level.
Examples
>>> idx = pd.Index(['A', 'C', 'A', 'B'], name='score') >>> idx.rename('grade') Index(['A', 'C', 'A', 'B'], dtype='object', name='grade')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'], ... [2018, 2019]], ... names=['kind', 'year']) >>> idx MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['kind', 'year']) >>> idx.rename(['species', 'year']) MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['species', 'year']) >>> idx.rename('species') Traceback (most recent call last): TypeError: Must pass list-like as `names`.
- sortlevel(level=None, ascending=True, sort_remaining=None)[source]
For internal compatibility with the Index API.
Sort the Index. This is for compat with MultiIndex
- get_level_values(level)
Return an Index of values for requested level.
This is primarily useful to get an individual level of values from a MultiIndex, but is provided on Index as well for compatibility.
- Parameters:
level (int or str) – It is either the integer position or the name of the level.
- Returns:
Calling object, as there is only one level in the Index.
- Return type:
See also
MultiIndex.get_level_valuesGet values for a level of a MultiIndex.
Notes
For Index, level should be 0, since there are no multiple levels.
Examples
>>> idx = pd.Index(list('abc')) >>> idx Index(['a', 'b', 'c'], dtype='object')
Get level values by supplying level as integer:
>>> idx.get_level_values(0) Index(['a', 'b', 'c'], dtype='object')
- final droplevel(level=0)[source]
Return index with requested level(s) removed.
If resulting index has only 1 level left, the result will be of Index type, not MultiIndex. The original index is not modified inplace.
- Parameters:
level (int, str, or list-like, default 0) – If a string is given, must be the name of a level If list-like, elements must be names or indexes of levels.
- Return type:
Index or MultiIndex
Examples
>>> mi = pd.MultiIndex.from_arrays( ... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z']) >>> mi MultiIndex([(1, 3, 5), (2, 4, 6)], names=['x', 'y', 'z'])
>>> mi.droplevel() MultiIndex([(3, 5), (4, 6)], names=['y', 'z'])
>>> mi.droplevel(2) MultiIndex([(1, 3), (2, 4)], names=['x', 'y'])
>>> mi.droplevel('z') MultiIndex([(1, 3), (2, 4)], names=['x', 'y'])
>>> mi.droplevel(['x', 'y']) Index([5, 6], dtype='int64', name='z')
- property is_monotonic_increasing: bool
Return a boolean if the values are equal or increasing.
- Return type:
See also
Index.is_monotonic_decreasingCheck if the values are equal or decreasing.
Examples
>>> pd.Index([1, 2, 3]).is_monotonic_increasing True >>> pd.Index([1, 2, 2]).is_monotonic_increasing True >>> pd.Index([1, 3, 2]).is_monotonic_increasing False
- property is_monotonic_decreasing: bool
Return a boolean if the values are equal or decreasing.
- Return type:
See also
Index.is_monotonic_increasingCheck if the values are equal or increasing.
Examples
>>> pd.Index([3, 2, 1]).is_monotonic_decreasing True >>> pd.Index([3, 2, 2]).is_monotonic_decreasing True >>> pd.Index([3, 1, 2]).is_monotonic_decreasing False
- is_unique
Return if the index has unique values.
- Return type:
See also
Index.has_duplicatesInverse method that checks if it has duplicate values.
Examples
>>> idx = pd.Index([1, 5, 7, 7]) >>> idx.is_unique False
>>> idx = pd.Index([1, 5, 7]) >>> idx.is_unique True
>>> idx = pd.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_unique False
>>> idx = pd.Index(["Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_unique True
- property has_duplicates: bool
Check if the Index has duplicate values.
- Returns:
Whether or not the Index has duplicate values.
- Return type:
See also
Index.is_uniqueInverse method that checks if it has unique values.
Examples
>>> idx = pd.Index([1, 5, 7, 7]) >>> idx.has_duplicates True
>>> idx = pd.Index([1, 5, 7]) >>> idx.has_duplicates False
>>> idx = pd.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.has_duplicates True
>>> idx = pd.Index(["Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.has_duplicates False
- final is_boolean()[source]
Check if the Index only consists of booleans.
Deprecated since version 2.0.0: Use pandas.api.types.is_bool_dtype instead.
- Returns:
Whether or not the Index only consists of booleans.
- Return type:
See also
is_integerCheck if the Index only consists of integers (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_objectCheck if the Index is of the object dtype (deprecated).
is_categoricalCheck if the Index holds categorical data.
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index([True, False, True]) >>> idx.is_boolean() True
>>> idx = pd.Index(["True", "False", "True"]) >>> idx.is_boolean() False
>>> idx = pd.Index([True, False, "True"]) >>> idx.is_boolean() False
- final is_integer()[source]
Check if the Index only consists of integers.
Deprecated since version 2.0.0: Use pandas.api.types.is_integer_dtype instead.
- Returns:
Whether or not the Index only consists of integers.
- Return type:
See also
is_booleanCheck if the Index only consists of booleans (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_objectCheck if the Index is of the object dtype. (deprecated).
is_categoricalCheck if the Index holds categorical data (deprecated).
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index([1, 2, 3, 4]) >>> idx.is_integer() True
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_integer() False
>>> idx = pd.Index(["Apple", "Mango", "Watermelon"]) >>> idx.is_integer() False
- final is_floating()[source]
Check if the Index is a floating type.
Deprecated since version 2.0.0: Use pandas.api.types.is_float_dtype instead
The Index may consist of only floats, NaNs, or a mix of floats, integers, or NaNs.
- Returns:
Whether or not the Index only consists of only consists of floats, NaNs, or a mix of floats, integers, or NaNs.
- Return type:
See also
is_booleanCheck if the Index only consists of booleans (deprecated).
is_integerCheck if the Index only consists of integers (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_objectCheck if the Index is of the object dtype. (deprecated).
is_categoricalCheck if the Index holds categorical data (deprecated).
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_floating() True
>>> idx = pd.Index([1.0, 2.0, np.nan, 4.0]) >>> idx.is_floating() True
>>> idx = pd.Index([1, 2, 3, 4, np.nan]) >>> idx.is_floating() True
>>> idx = pd.Index([1, 2, 3, 4]) >>> idx.is_floating() False
- final is_numeric()[source]
Check if the Index only consists of numeric data.
Deprecated since version 2.0.0: Use pandas.api.types.is_numeric_dtype instead.
- Returns:
Whether or not the Index only consists of numeric data.
- Return type:
See also
is_booleanCheck if the Index only consists of booleans (deprecated).
is_integerCheck if the Index only consists of integers (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_objectCheck if the Index is of the object dtype. (deprecated).
is_categoricalCheck if the Index holds categorical data (deprecated).
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_numeric() True
>>> idx = pd.Index([1, 2, 3, 4.0]) >>> idx.is_numeric() True
>>> idx = pd.Index([1, 2, 3, 4]) >>> idx.is_numeric() True
>>> idx = pd.Index([1, 2, 3, 4.0, np.nan]) >>> idx.is_numeric() True
>>> idx = pd.Index([1, 2, 3, 4.0, np.nan, "Apple"]) >>> idx.is_numeric() False
- final is_object()[source]
Check if the Index is of the object dtype.
Deprecated since version 2.0.0: Use pandas.api.types.is_object_dtype instead.
- Returns:
Whether or not the Index is of the object dtype.
- Return type:
See also
is_booleanCheck if the Index only consists of booleans (deprecated).
is_integerCheck if the Index only consists of integers (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_categoricalCheck if the Index holds categorical data (deprecated).
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index(["Apple", "Mango", "Watermelon"]) >>> idx.is_object() True
>>> idx = pd.Index(["Apple", "Mango", 2.0]) >>> idx.is_object() True
>>> idx = pd.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_object() False
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_object() False
- final is_categorical()[source]
Check if the Index holds categorical data.
Deprecated since version 2.0.0: Use
pandas.api.types.is_categorical_dtype()instead.- Returns:
True if the Index is categorical.
- Return type:
See also
CategoricalIndexIndex for categorical data.
is_booleanCheck if the Index only consists of booleans (deprecated).
is_integerCheck if the Index only consists of integers (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_objectCheck if the Index is of the object dtype. (deprecated).
is_intervalCheck if the Index holds Interval objects (deprecated).
Examples
>>> idx = pd.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_categorical() True
>>> idx = pd.Index([1, 3, 5, 7]) >>> idx.is_categorical() False
>>> s = pd.Series(["Peter", "Victor", "Elisabeth", "Mar"]) >>> s 0 Peter 1 Victor 2 Elisabeth 3 Mar dtype: object >>> s.index.is_categorical() False
- final is_interval()[source]
Check if the Index holds Interval objects.
Deprecated since version 2.0.0: Use pandas.api.types.is_interval_dtype instead.
- Returns:
Whether or not the Index holds Interval objects.
- Return type:
See also
IntervalIndexIndex for Interval objects.
is_booleanCheck if the Index only consists of booleans (deprecated).
is_integerCheck if the Index only consists of integers (deprecated).
is_floatingCheck if the Index is a floating type (deprecated).
is_numericCheck if the Index only consists of numeric data (deprecated).
is_objectCheck if the Index is of the object dtype. (deprecated).
is_categoricalCheck if the Index holds categorical data (deprecated).
Examples
>>> idx = pd.Index([pd.Interval(left=0, right=5), ... pd.Interval(left=5, right=10)]) >>> idx.is_interval() True
>>> idx = pd.Index([1, 3, 5, 7]) >>> idx.is_interval() False
- final holds_integer()[source]
Whether the type is an integer type.
Deprecated since version 2.0.0: Use pandas.api.types.infer_dtype instead
- Return type:
- inferred_type
Return a string of the type inferred from the values.
- final isna()[source]
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as
None,numpy.NaNorpd.NaT, get mapped toTruevalues. Everything else get mapped toFalsevalues. Characters such as empty strings ‘’ ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
A boolean array of whether my values are NA.
- Return type:
numpy.ndarray[bool]
See also
Index.notnaBoolean inverse of isna.
Index.dropnaOmit entries with missing values.
isnaTop-level isna.
Series.isnaDetect missing values in Series object.
Examples
Show which entries in a pandas.Index are NA. The result is an array.
>>> idx = pd.Index([5.2, 6.0, np.NaN]) >>> idx Index([5.2, 6.0, nan], dtype='float64') >>> idx.isna() array([False, False, True])
Empty strings are not considered NA values. None is considered an NA value.
>>> idx = pd.Index(['black', '', 'red', None]) >>> idx Index(['black', '', 'red', None], dtype='object') >>> idx.isna() array([False, False, False, True])
For datetimes, NaT (Not a Time) is considered as an NA value.
>>> idx = pd.DatetimeIndex([pd.Timestamp('1940-04-25'), ... pd.Timestamp(''), None, pd.NaT]) >>> idx DatetimeIndex(['1940-04-25', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None) >>> idx.isna() array([False, True, True, True])
- isnull()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as
None,numpy.NaNorpd.NaT, get mapped toTruevalues. Everything else get mapped toFalsevalues. Characters such as empty strings ‘’ ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
A boolean array of whether my values are NA.
- Return type:
numpy.ndarray[bool]
See also
Index.notnaBoolean inverse of isna.
Index.dropnaOmit entries with missing values.
isnaTop-level isna.
Series.isnaDetect missing values in Series object.
Examples
Show which entries in a pandas.Index are NA. The result is an array.
>>> idx = pd.Index([5.2, 6.0, np.NaN]) >>> idx Index([5.2, 6.0, nan], dtype='float64') >>> idx.isna() array([False, False, True])
Empty strings are not considered NA values. None is considered an NA value.
>>> idx = pd.Index(['black', '', 'red', None]) >>> idx Index(['black', '', 'red', None], dtype='object') >>> idx.isna() array([False, False, False, True])
For datetimes, NaT (Not a Time) is considered as an NA value.
>>> idx = pd.DatetimeIndex([pd.Timestamp('1940-04-25'), ... pd.Timestamp(''), None, pd.NaT]) >>> idx DatetimeIndex(['1940-04-25', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None) >>> idx.isna() array([False, True, True, True])
- final notna()[source]
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped toFalsevalues.- Returns:
Boolean array to indicate which entries are not NA.
- Return type:
numpy.ndarray[bool]
See also
Index.notnullAlias of notna.
Index.isnaInverse of notna.
notnaTop-level notna.
Examples
Show which entries in an Index are not NA. The result is an array.
>>> idx = pd.Index([5.2, 6.0, np.NaN]) >>> idx Index([5.2, 6.0, nan], dtype='float64') >>> idx.notna() array([ True, True, False])
Empty strings are not considered NA values. None is considered a NA value.
>>> idx = pd.Index(['black', '', 'red', None]) >>> idx Index(['black', '', 'red', None], dtype='object') >>> idx.notna() array([ True, True, True, False])
- notnull()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped toFalsevalues.- Returns:
Boolean array to indicate which entries are not NA.
- Return type:
numpy.ndarray[bool]
See also
Index.notnullAlias of notna.
Index.isnaInverse of notna.
notnaTop-level notna.
Examples
Show which entries in an Index are not NA. The result is an array.
>>> idx = pd.Index([5.2, 6.0, np.NaN]) >>> idx Index([5.2, 6.0, nan], dtype='float64') >>> idx.notna() array([ True, True, False])
Empty strings are not considered NA values. None is considered a NA value.
>>> idx = pd.Index(['black', '', 'red', None]) >>> idx Index(['black', '', 'red', None], dtype='object') >>> idx.notna() array([ True, True, True, False])
- fillna(value=None, downcast=None)[source]
Fill NA/NaN values with the specified value.
- Parameters:
value (scalar) – Scalar value to use to fill holes (e.g. 0). This value cannot be a list-likes.
downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
- Return type:
See also
DataFrame.fillnaFill NaN values of a DataFrame.
Series.fillnaFill NaN Values of a Series.
- dropna(how='any')[source]
Return Index without NA/NaN values.
- Parameters:
how ({'any', 'all'}, default 'any') – If the Index is a MultiIndex, drop the value when any or all levels are NaN.
self (_IndexT) –
- Return type:
- unique(level=None)[source]
Return unique values in the index.
Unique values are returned in order of appearance, this does NOT sort.
- Parameters:
level (int or hashable, optional) – Only return values from specified level (for MultiIndex). If int, gets the level by integer position, else by level name.
self (_IndexT) –
- Return type:
See also
uniqueNumpy array of unique values in that column.
Series.uniqueReturn unique values of Series object.
- drop_duplicates(*, keep='first')[source]
Return Index with duplicate values removed.
- Parameters:
keep ({‘first’, ‘last’,
False}, default ‘first’) –‘first’ : Drop duplicates except for the first occurrence.
’last’ : Drop duplicates except for the last occurrence.
False: Drop all duplicates.
self (_IndexT) –
- Return type:
See also
Series.drop_duplicatesEquivalent method on Series.
DataFrame.drop_duplicatesEquivalent method on DataFrame.
Index.duplicatedRelated method on Index, indicating duplicate Index values.
Examples
Generate an pandas.Index with duplicate values.
>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'])
The keep parameter controls which duplicate values are removed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.
>>> idx.drop_duplicates(keep='first') Index(['lama', 'cow', 'beetle', 'hippo'], dtype='object')
The value ‘last’ keeps the last occurrence for each set of duplicated entries.
>>> idx.drop_duplicates(keep='last') Index(['cow', 'beetle', 'lama', 'hippo'], dtype='object')
The value
Falsediscards all sets of duplicated entries.>>> idx.drop_duplicates(keep=False) Index(['cow', 'beetle', 'hippo'], dtype='object')
- duplicated(keep='first')[source]
Indicate duplicate index values.
Duplicated values are indicated as
Truevalues in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.- Parameters:
keep ({'first', 'last', False}, default 'first') –
The value or values in a set of duplicates to mark as missing.
’first’ : Mark duplicates as
Trueexcept for the first occurrence.’last’ : Mark duplicates as
Trueexcept for the last occurrence.False: Mark all duplicates asTrue.
- Return type:
np.ndarray[bool]
See also
Series.duplicatedEquivalent method on pandas.Series.
DataFrame.duplicatedEquivalent method on pandas.DataFrame.
Index.drop_duplicatesRemove duplicate values from Index.
Examples
By default, for each set of duplicated values, the first occurrence is set to False and all others to True:
>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> idx.duplicated() array([False, False, True, False, True])
which is equivalent to
>>> idx.duplicated(keep='first') array([False, False, True, False, True])
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> idx.duplicated(keep='last') array([ True, False, True, False, False])
By setting keep on
False, all duplicates are True:>>> idx.duplicated(keep=False) array([ True, False, True, False, True])
- final union(other, sort=None)[source]
Form the union of two Index objects.
If the Index objects are incompatible, both Index objects will be cast to dtype(‘object’) first.
- Parameters:
other (Index or array-like) –
sort (bool or None, default None) –
Whether to sort the resulting Index.
None : Sort the result, except when
self and other are equal.
self or other has length 0.
Some values in self or other cannot be compared. A RuntimeWarning is issued in this case.
False : do not sort the result.
True : Sort the result (which may raise TypeError).
- Return type:
Examples
Union matching dtypes
>>> idx1 = pd.Index([1, 2, 3, 4]) >>> idx2 = pd.Index([3, 4, 5, 6]) >>> idx1.union(idx2) Index([1, 2, 3, 4, 5, 6], dtype='int64')
Union mismatched dtypes
>>> idx1 = pd.Index(['a', 'b', 'c', 'd']) >>> idx2 = pd.Index([1, 2, 3, 4]) >>> idx1.union(idx2) Index(['a', 'b', 'c', 'd', 1, 2, 3, 4], dtype='object')
MultiIndex case
>>> idx1 = pd.MultiIndex.from_arrays( ... [[1, 1, 2, 2], ["Red", "Blue", "Red", "Blue"]] ... ) >>> idx1 MultiIndex([(1, 'Red'), (1, 'Blue'), (2, 'Red'), (2, 'Blue')], ) >>> idx2 = pd.MultiIndex.from_arrays( ... [[3, 3, 2, 2], ["Red", "Green", "Red", "Green"]] ... ) >>> idx2 MultiIndex([(3, 'Red'), (3, 'Green'), (2, 'Red'), (2, 'Green')], ) >>> idx1.union(idx2) MultiIndex([(1, 'Blue'), (1, 'Red'), (2, 'Blue'), (2, 'Green'), (2, 'Red'), (3, 'Green'), (3, 'Red')], ) >>> idx1.union(idx2, sort=False) MultiIndex([(1, 'Red'), (1, 'Blue'), (2, 'Red'), (2, 'Blue'), (3, 'Red'), (3, 'Green'), (2, 'Green')], )
- final intersection(other, sort=False)[source]
Form the intersection of two Index objects.
This returns a new Index with elements common to the index and other.
- Parameters:
other (Index or array-like) –
sort (True, False or None, default False) –
Whether to sort the resulting index.
None : sort the result, except when self and other are equal or when the values cannot be compared.
False : do not sort the result.
True : Sort the result (which may raise TypeError).
- Return type:
Examples
>>> idx1 = pd.Index([1, 2, 3, 4]) >>> idx2 = pd.Index([3, 4, 5, 6]) >>> idx1.intersection(idx2) Index([3, 4], dtype='int64')
- final difference(other, sort=None)[source]
Return a new Index with elements of index not in other.
This is the set difference of two Index objects.
- Parameters:
other (Index or array-like) –
sort (bool or None, default None) –
Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.
None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.
False : Do not sort the result.
True : Sort the result (which may raise TypeError).
- Return type:
Examples
>>> idx1 = pd.Index([2, 1, 3, 4]) >>> idx2 = pd.Index([3, 4, 5, 6]) >>> idx1.difference(idx2) Index([1, 2], dtype='int64') >>> idx1.difference(idx2, sort=False) Index([2, 1], dtype='int64')
- symmetric_difference(other, result_name=None, sort=None)[source]
Compute the symmetric difference of two Index objects.
- Parameters:
other (Index or array-like) –
result_name (str) –
sort (bool or None, default None) –
Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.
None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.
False : Do not sort the result.
True : Sort the result (which may raise TypeError).
- Return type:
Notes
symmetric_differencecontains elements that appear in eitheridx1oridx2but not both. Equivalent to the Index created byidx1.difference(idx2) | idx2.difference(idx1)with duplicates dropped.Examples
>>> idx1 = pd.Index([1, 2, 3, 4]) >>> idx2 = pd.Index([2, 3, 4, 5]) >>> idx1.symmetric_difference(idx2) Index([1, 5], dtype='int64')
- get_loc(key)[source]
Get integer location, slice or boolean mask for requested label.
- Parameters:
key (label) –
- Return type:
int if unique index, slice if monotonic index, else mask
Examples
>>> unique_index = pd.Index(list('abc')) >>> unique_index.get_loc('b') 1
>>> monotonic_index = pd.Index(list('abbc')) >>> monotonic_index.get_loc('b') slice(1, 3, None)
>>> non_monotonic_index = pd.Index(list('abcb')) >>> non_monotonic_index.get_loc('b') array([False, True, False, True])
- final get_indexer(target, method=None, limit=None, tolerance=None)[source]
Compute indexer and mask for new index given the current index.
The indexer should be then used as an input to ndarray.take to align the current data to the new index.
- Parameters:
target (Index) –
method ({None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional) –
default: exact matches only.
pad / ffill: find the PREVIOUS index value if no exact match.
backfill / bfill: use NEXT index value if no exact match
nearest: use the NEAREST index value if no exact match. Tied distances are broken by preferring the larger index value.
limit (int, optional) – Maximum number of consecutive labels in
targetto match for inexact matches.tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations must satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Returns:
Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
- Return type:
np.ndarray[np.intp]
Notes
Returns -1 for unmatched values, for further explanation see the example below.
Examples
>>> index = pd.Index(['c', 'a', 'b']) >>> index.get_indexer(['a', 'b', 'x']) array([ 1, 2, -1])
Notice that the return value is an array of locations in
indexandxis marked by -1, as it is not inindex.
- reindex(target, method=None, level=None, limit=None, tolerance=None)[source]
Create index with target’s values.
- Parameters:
target (an iterable) –
method ({None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional) –
default: exact matches only.
pad / ffill: find the PREVIOUS index value if no exact match.
backfill / bfill: use NEXT index value if no exact match
nearest: use the NEAREST index value if no exact match. Tied distances are broken by preferring the larger index value.
level (int, optional) – Level of multiindex.
limit (int, optional) – Maximum number of consecutive labels in
targetto match for inexact matches.tolerance (int or float, optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations must satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Returns:
new_index (pd.Index) – Resulting index.
indexer (np.ndarray[np.intp] or None) – Indices of output values in original index.
- Raises:
TypeError – If
methodpassed along withlevel.ValueError – If non-unique multi-index
ValueError – If non-unique index and
methodorlimitpassed.
- Return type:
See also
Series.reindexConform Series to new index with optional filling logic.
DataFrame.reindexConform DataFrame to new index with optional filling logic.
Examples
>>> idx = pd.Index(['car', 'bike', 'train', 'tractor']) >>> idx Index(['car', 'bike', 'train', 'tractor'], dtype='object') >>> idx.reindex(['car', 'bike']) (Index(['car', 'bike'], dtype='object'), array([0, 1]))
- property values: ExtensionArray | ndarray
Return an array representing the data in the Index.
Warning
We recommend using
Index.arrayorIndex.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.- Returns:
array
- Return type:
numpy.ndarray or ExtensionArray
See also
Index.arrayReference to the underlying data.
Index.to_numpyA NumPy array representing the underlying data.
- array
The ExtensionArray of the data backing this Series or Index.
- Returns:
An ExtensionArray of the values stored within. For extension types, this is the actual array. For NumPy native types, this is a thin (no copy) wrapper around
numpy.ndarray..arraydiffers.valueswhich may require converting the data to a different form.- Return type:
ExtensionArray
See also
Index.to_numpySimilar method that always returns a NumPy array.
Series.to_numpySimilar method that always returns a NumPy array.
Notes
This table lays out the different array types for each extension dtype within pandas.
dtype
array type
category
Categorical
period
PeriodArray
interval
IntervalArray
IntegerNA
IntegerArray
string
StringArray
boolean
BooleanArray
datetime64[ns, tz]
DatetimeArray
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes
.arraywill be aarrays.NumpyExtensionArraywrapping the actual ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then useSeries.to_numpy()instead.Examples
For regular NumPy types like int, and float, a PandasArray is returned.
>>> pd.Series([1, 2, 3]).array <PandasArray> [1, 2, 3] Length: 3, dtype: int64
For extension types, like Categorical, the actual ExtensionArray is returned
>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a'])) >>> ser.array ['a', 'b', 'a'] Categories (2, object): ['a', 'b']
- memory_usage(deep=False)[source]
Memory usage of the values.
- Parameters:
deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.
- Return type:
bytes used
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of the array.
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy
- final where(cond, other=None)[source]
Replace values where the condition is False.
The replacement is taken from other.
- Parameters:
cond (bool array-like with the same length as self) – Condition to select the values on.
other (scalar, or array-like, default None) – Replacement if the condition is False.
- Returns:
A copy of self with values replaced from other where the condition is False.
- Return type:
See also
Series.whereSame method for Series.
DataFrame.whereSame method for DataFrame.
Examples
>>> idx = pd.Index(['car', 'bike', 'train', 'tractor']) >>> idx Index(['car', 'bike', 'train', 'tractor'], dtype='object') >>> idx.where(idx.isin(['car', 'train']), 'other') Index(['car', 'other', 'train', 'other'], dtype='object')
- putmask(mask, value)[source]
Return a new Index of the values set with the mask.
- Return type:
See also
numpy.ndarray.putmaskChanges elements of an array based on conditional and input values.
- equals(other)[source]
Determine if two Index object are equal.
The things that are being compared are:
The elements inside the Index object.
The order of the elements inside the Index object.
- Parameters:
other (Any) – The other object to compare against.
- Returns:
True if “other” is an Index and it has the same elements and order as the calling index; False otherwise.
- Return type:
Examples
>>> idx1 = pd.Index([1, 2, 3]) >>> idx1 Index([1, 2, 3], dtype='int64') >>> idx1.equals(pd.Index([1, 2, 3])) True
The elements inside are compared
>>> idx2 = pd.Index(["1", "2", "3"]) >>> idx2 Index(['1', '2', '3'], dtype='object')
>>> idx1.equals(idx2) False
The order is compared
>>> ascending_idx = pd.Index([1, 2, 3]) >>> ascending_idx Index([1, 2, 3], dtype='int64') >>> descending_idx = pd.Index([3, 2, 1]) >>> descending_idx Index([3, 2, 1], dtype='int64') >>> ascending_idx.equals(descending_idx) False
The dtype is not compared
>>> int64_idx = pd.Index([1, 2, 3], dtype='int64') >>> int64_idx Index([1, 2, 3], dtype='int64') >>> uint64_idx = pd.Index([1, 2, 3], dtype='uint64') >>> uint64_idx Index([1, 2, 3], dtype='uint64') >>> int64_idx.equals(uint64_idx) True
- final identical(other)[source]
Similar to equals, but checks that object attributes and types are also equal.
- Returns:
If two Index objects have equal elements and same type True, otherwise False.
- Return type:
- final asof(label)[source]
Return the label from the index, or, if not present, the previous one.
Assuming that the index is sorted, return the passed index label if it is in the index, or return the previous index label if the passed one is not in the index.
- Parameters:
label (object) – The label up to which the method returns the latest index label.
- Returns:
The passed label if it is in the index. The previous label if the passed label is not in the sorted index or NaN if there is no such label.
- Return type:
See also
Series.asofReturn the latest value in a Series up to the passed index.
merge_asofPerform an asof merge (similar to left join but it matches on nearest key rather than equal key).
Index.get_locAn asof is a thin wrapper around get_loc with method=’pad’.
Examples
Index.asof returns the latest index label up to the passed label.
>>> idx = pd.Index(['2013-12-31', '2014-01-02', '2014-01-03']) >>> idx.asof('2014-01-01') '2013-12-31'
If the label is in the index, the method returns the passed label.
>>> idx.asof('2014-01-02') '2014-01-02'
If all of the labels in the index are later than the passed label, NaN is returned.
>>> idx.asof('1999-01-02') nan
If the index is not sorted, an error is raised.
>>> idx_not_sorted = pd.Index(['2013-12-31', '2015-01-02', ... '2014-01-03']) >>> idx_not_sorted.asof('2013-12-31') Traceback (most recent call last): ValueError: index must be monotonic increasing or decreasing
- asof_locs(where, mask)[source]
Return the locations (indices) of labels in the index.
As in the asof function, if the label (a particular entry in where) is not in the index, the latest index label up to the passed label is chosen and its index returned.
If all of the labels in the index are later than a label in where, -1 is returned.
mask is used to ignore NA values in the index during calculation.
- Parameters:
- Returns:
An array of locations (indices) of the labels from the Index which correspond to the return values of the asof function for every element in where.
- Return type:
np.ndarray[np.intp]
- sort_values(return_indexer=False, ascending=True, na_position='last', key=None)[source]
Return a sorted copy of the index.
Return a sorted copy of the index, and optionally return the indices that sorted the index itself.
- Parameters:
return_indexer (bool, default False) – Should the indices that would sort the index be returned.
ascending (bool, default True) – Should the index values be sorted in an ascending order.
na_position ({'first' or 'last'}, default 'last') –
Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
New in version 1.2.0.
key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape.New in version 1.1.0.
- Returns:
sorted_index (pandas.Index) – Sorted copy of the index.
indexer (numpy.ndarray, optional) – The indices that the index itself was sorted by.
See also
Series.sort_valuesSort values of a Series.
DataFrame.sort_valuesSort values in a DataFrame.
Examples
>>> idx = pd.Index([10, 100, 1, 1000]) >>> idx Index([10, 100, 1, 1000], dtype='int64')
Sort values in ascending order (default behavior).
>>> idx.sort_values() Index([1, 10, 100, 1000], dtype='int64')
Sort values in descending order, and also get the indices idx was sorted by.
>>> idx.sort_values(ascending=False, return_indexer=True) (Index([1000, 100, 10, 1], dtype='int64'), array([3, 1, 0, 2]))
- shift(periods=1, freq=None)[source]
Shift index by desired number of time frequency increments.
This method is for shifting the values of datetime-like indexes by a specified time increment a given number of times.
- Parameters:
periods (int, default 1) – Number of periods (or increments) to shift by, can be positive or negative.
freq (pandas.DateOffset, pandas.Timedelta or str, optional) – Frequency increment to shift by. If None, the index is shifted by its own freq attribute. Offset aliases are valid strings, e.g., ‘D’, ‘W’, ‘M’ etc.
- Returns:
Shifted index.
- Return type:
See also
Series.shiftShift values of Series.
Notes
This method is only implemented for datetime-like index classes, i.e., DatetimeIndex, PeriodIndex and TimedeltaIndex.
Examples
Put the first 5 month starts of 2011 into an index.
>>> month_starts = pd.date_range('1/1/2011', periods=5, freq='MS') >>> month_starts DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01', '2011-05-01'], dtype='datetime64[ns]', freq='MS')
Shift the index by 10 days.
>>> month_starts.shift(10, freq='D') DatetimeIndex(['2011-01-11', '2011-02-11', '2011-03-11', '2011-04-11', '2011-05-11'], dtype='datetime64[ns]', freq=None)
The default value of freq is the freq attribute of the index, which is ‘MS’ (month start) in this example.
>>> month_starts.shift(10) DatetimeIndex(['2011-11-01', '2011-12-01', '2012-01-01', '2012-02-01', '2012-03-01'], dtype='datetime64[ns]', freq='MS')
- argsort(*args, **kwargs)[source]
Return the integer indices that would sort the index.
- Parameters:
*args – Passed to numpy.ndarray.argsort.
**kwargs – Passed to numpy.ndarray.argsort.
- Returns:
Integer indices that would sort the index if used as an indexer.
- Return type:
np.ndarray[np.intp]
See also
numpy.argsortSimilar method for NumPy arrays.
Index.sort_valuesReturn sorted copy of Index.
Examples
>>> idx = pd.Index(['b', 'a', 'd', 'c']) >>> idx Index(['b', 'a', 'd', 'c'], dtype='object')
>>> order = idx.argsort() >>> order array([1, 0, 3, 2])
>>> idx[order] Index(['a', 'b', 'c', 'd'], dtype='object')
- get_indexer_non_unique(target)[source]
Compute indexer and mask for new index given the current index.
The indexer should be then used as an input to ndarray.take to align the current data to the new index.
- Parameters:
target (Index) –
- Returns:
indexer (np.ndarray[np.intp]) – Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
missing (np.ndarray[np.intp]) – An indexer into the target of the values not found. These correspond to the -1 in the indexer array.
- Return type:
tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]
Examples
>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['b', 'b']) (array([1, 3, 4, 1, 3, 4]), array([], dtype=int64))
In the example below there are no matched values.
>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['q', 'r', 't']) (array([-1, -1, -1]), array([0, 1, 2]))
For this reason, the returned
indexercontains only integers equal to -1. It demonstrates that there’s no match between the index and thetargetvalues at these positions. The mask [0, 1, 2] in the return value shows that the first, second, and third elements are missing.Notice that the return value is a tuple contains two items. In the example below the first item is an array of locations in
index. The second item is a mask shows that the first and third elements are missing.>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['f', 'b', 's']) (array([-1, 1, 3, 4, -1]), array([0, 2]))
- final get_indexer_for(target)[source]
Guaranteed return of an indexer even when non-unique.
This dispatches to get_indexer or get_indexer_non_unique as appropriate.
- Returns:
List of indices.
- Return type:
np.ndarray[np.intp]
Examples
>>> idx = pd.Index([np.nan, 'var1', np.nan]) >>> idx.get_indexer_for([np.nan]) array([0, 2])
- final groupby(values)[source]
Group the index labels by a given array of values.
- Parameters:
values (array) – Values used to determine the groups.
- Returns:
{group name -> group labels}
- Return type:
- map(mapper, na_action=None)[source]
Map values using an input mapping or function.
- Parameters:
- Returns:
The output of the mapping function applied to the index. If the function returns a tuple with more than one element a MultiIndex will be returned.
- Return type:
Union[Index, MultiIndex]
- isin(values, level=None)[source]
Return a boolean array where the index values are in values.
Compute boolean array of whether each index value is found in the passed set of values. The length of the returned boolean array matches the length of the index.
- Parameters:
- Returns:
NumPy array of boolean values.
- Return type:
np.ndarray[bool]
See also
Series.isinSame for Series.
DataFrame.isinSame method for DataFrames.
Notes
In the case of MultiIndex you must either specify values as a list-like object containing tuples that are the same length as the number of levels, or specify level. Otherwise it will raise a
ValueError.If level is specified:
if it is the name of one and only one index level, use that level;
otherwise it should be a number indicating level position.
Examples
>>> idx = pd.Index([1,2,3]) >>> idx Index([1, 2, 3], dtype='int64')
Check whether each index value in a list of values.
>>> idx.isin([1, 4]) array([ True, False, False])
>>> midx = pd.MultiIndex.from_arrays([[1,2,3], ... ['red', 'blue', 'green']], ... names=('number', 'color')) >>> midx MultiIndex([(1, 'red'), (2, 'blue'), (3, 'green')], names=['number', 'color'])
Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.
>>> midx.isin(['red', 'orange', 'yellow'], level='color') array([ True, False, False])
To check across the levels of a MultiIndex, pass a list of tuples:
>>> midx.isin([(1, 'red'), (3, 'red')]) array([ True, False, False])
For a DatetimeIndex, string values in values are converted to Timestamps.
>>> dates = ['2000-03-11', '2000-03-12', '2000-03-13'] >>> dti = pd.to_datetime(dates) >>> dti DatetimeIndex(['2000-03-11', '2000-03-12', '2000-03-13'], dtype='datetime64[ns]', freq=None)
>>> dti.isin(['2000-03-11']) array([ True, False, False])
- slice_indexer(start=None, end=None, step=None)[source]
Compute the slice indexer for input labels and step.
Index needs to be ordered and unique.
- Parameters:
start (label, default None) – If None, defaults to the beginning.
end (label, default None) – If None, defaults to the end.
step (int, default None) –
- Return type:
:raises KeyError : If key does not exist, or key is not unique and index is: not ordered.
Notes
This function assumes that the data is sorted, so use at your own peril
Examples
This is a method on all index types. For example you can do:
>>> idx = pd.Index(list('abcd')) >>> idx.slice_indexer(start='b', end='c') slice(1, 3, None)
>>> idx = pd.MultiIndex.from_arrays([list('abcd'), list('efgh')]) >>> idx.slice_indexer(start='b', end=('c', 'g')) slice(1, 3, None)
- get_slice_bound(label, side)[source]
Calculate slice bound that corresponds to given label.
Returns leftmost (one-past-the-rightmost if
side=='right') position of given label.
- slice_locs(start=None, end=None, step=None)[source]
Compute slice locations for input labels.
- Parameters:
start (label, default None) – If None, defaults to the beginning.
end (label, default None) – If None, defaults to the end.
step (int, defaults None) – If None, defaults to 1.
- Return type:
See also
Index.get_locGet location for a single label.
Notes
This method only works if the index is monotonic or unique.
Examples
>>> idx = pd.Index(list('abcd')) >>> idx.slice_locs(start='b', end='c') (1, 3)
- delete(loc)[source]
Make new Index with passed location(-s) deleted.
- Parameters:
- Returns:
Will be same type as self, except for RangeIndex.
- Return type:
See also
numpy.deleteDelete any rows and column from NumPy array (ndarray).
Examples
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx.delete(1) Index(['a', 'c'], dtype='object')
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx.delete([0, 2]) Index(['b'], dtype='object')
- insert(loc, item)[source]
Make new Index inserting new item at location.
Follows Python numpy.insert semantics for negative values.
- drop(labels, errors='raise')[source]
Make new Index with passed list of labels deleted.
- Parameters:
labels (array-like or scalar) –
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and existing labels are dropped.
- Returns:
Will be same type as self, except for RangeIndex.
- Return type:
- Raises:
KeyError – If not all of the labels are found in the selected axis
- any(*args, **kwargs)[source]
Return whether any element is Truthy.
- Parameters:
*args – Required for compatibility with numpy.
**kwargs – Required for compatibility with numpy.
- Returns:
A single element array-like may be converted to bool.
- Return type:
bool or array-like (if axis is specified)
See also
Index.allReturn whether all elements are True.
Series.allReturn whether all elements are True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> index = pd.Index([0, 1, 2]) >>> index.any() True
>>> index = pd.Index([0, 0, 0]) >>> index.any() False
- all(*args, **kwargs)[source]
Return whether all elements are Truthy.
- Parameters:
*args – Required for compatibility with numpy.
**kwargs – Required for compatibility with numpy.
- Returns:
A single element array-like may be converted to bool.
- Return type:
bool or array-like (if axis is specified)
See also
Index.anyReturn whether any element in an Index is True.
Series.anyReturn whether any element in a Series is True.
Series.allReturn whether all elements in a Series are True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
True, because nonzero integers are considered True.
>>> pd.Index([1, 2, 3]).all() True
False, because
0is considered False.>>> pd.Index([0, 1, 2]).all() False
- argmin(axis=None, skipna=True, *args, **kwargs)[source]
Return int position of the smallest value in the Series.
If the minimum is achieved in multiple locations, the first row position is returned.
- Parameters:
axis ({None}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values when showing the result.
*args – Additional arguments and keywords for compatibility with NumPy.
**kwargs – Additional arguments and keywords for compatibility with NumPy.
- Returns:
Row position of the minimum value.
- Return type:
See also
Series.argminReturn position of the minimum value.
Series.argmaxReturn position of the maximum value.
numpy.ndarray.argminEquivalent method for numpy arrays.
Series.idxmaxReturn index label of the maximum values.
Series.idxminReturn index label of the minimum values.
Examples
Consider dataset containing cereal calories
>>> s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0, ... 'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0}) >>> s Corn Flakes 100.0 Almond Delight 110.0 Cinnamon Toast Crunch 120.0 Cocoa Puff 110.0 dtype: float64
>>> s.argmax() 2 >>> s.argmin() 0
The maximum cereal calories is the third element and the minimum cereal calories is the first element, since series is zero-indexed.
- argmax(axis=None, skipna=True, *args, **kwargs)[source]
Return int position of the largest value in the Series.
If the maximum is achieved in multiple locations, the first row position is returned.
- Parameters:
axis ({None}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values when showing the result.
*args – Additional arguments and keywords for compatibility with NumPy.
**kwargs – Additional arguments and keywords for compatibility with NumPy.
- Returns:
Row position of the maximum value.
- Return type:
See also
Series.argmaxReturn position of the maximum value.
Series.argminReturn position of the minimum value.
numpy.ndarray.argmaxEquivalent method for numpy arrays.
Series.idxmaxReturn index label of the maximum values.
Series.idxminReturn index label of the minimum values.
Examples
Consider dataset containing cereal calories
>>> s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0, ... 'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0}) >>> s Corn Flakes 100.0 Almond Delight 110.0 Cinnamon Toast Crunch 120.0 Cocoa Puff 110.0 dtype: float64
>>> s.argmax() 2 >>> s.argmin() 0
The maximum cereal calories is the third element and the minimum cereal calories is the first element, since series is zero-indexed.
- min(axis=None, skipna=True, *args, **kwargs)[source]
Return the minimum value of the Index.
- Parameters:
axis ({None}) – Dummy argument for consistency with Series.
skipna (bool, default True) – Exclude NA/null values when showing the result.
*args – Additional arguments and keywords for compatibility with NumPy.
**kwargs – Additional arguments and keywords for compatibility with NumPy.
- Returns:
Minimum value.
- Return type:
scalar
See also
Index.maxReturn the maximum value of the object.
Series.minReturn the minimum value in a Series.
DataFrame.minReturn the minimum values in a DataFrame.
Examples
>>> idx = pd.Index([3, 2, 1]) >>> idx.min() 1
>>> idx = pd.Index(['c', 'b', 'a']) >>> idx.min() 'a'
For a MultiIndex, the minimum is determined lexicographically.
>>> idx = pd.MultiIndex.from_product([('a', 'b'), (2, 1)]) >>> idx.min() ('a', 1)
- max(axis=None, skipna=True, *args, **kwargs)[source]
Return the maximum value of the Index.
- Parameters:
axis (int, optional) – For compatibility with NumPy. Only 0 or None are allowed.
skipna (bool, default True) – Exclude NA/null values when showing the result.
*args – Additional arguments and keywords for compatibility with NumPy.
**kwargs – Additional arguments and keywords for compatibility with NumPy.
- Returns:
Maximum value.
- Return type:
scalar
See also
Index.minReturn the minimum value in an Index.
Series.maxReturn the maximum value in a Series.
DataFrame.maxReturn the maximum values in a DataFrame.
Examples
>>> idx = pd.Index([3, 2, 1]) >>> idx.max() 3
>>> idx = pd.Index(['c', 'b', 'a']) >>> idx.max() 'c'
For a MultiIndex, the maximum is determined lexicographically.
>>> idx = pd.MultiIndex.from_product([('a', 'b'), (2, 1)]) >>> idx.max() ('b', 2)
- class pandas.Int16Dtype[source]
An ExtensionDtype for int16 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
int16
- class pandas.Int32Dtype[source]
An ExtensionDtype for int32 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
int32
- class pandas.Int64Dtype[source]
An ExtensionDtype for int64 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
int64
- class pandas.Int8Dtype[source]
An ExtensionDtype for int8 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
int8
- class pandas.Interval
Immutable object implementing an Interval, a bounded slice-like interval.
- Parameters:
left (orderable scalar) – Left bound for the interval.
right (orderable scalar) – Right bound for the interval.
closed ({'right', 'left', 'both', 'neither'}, default 'right') – Whether the interval is closed on the left-side, right-side, both or neither. See the Notes for more detailed explanation.
See also
IntervalIndexAn Index of Interval objects that are all closed on the same side.
cutConvert continuous data into discrete bins (Categorical of Interval objects).
qcutConvert continuous data into bins (Categorical of Interval objects) based on quantiles.
PeriodRepresents a period of time.
Notes
The parameters left and right must be from the same type, you must be able to compare them and they must satisfy
left <= right.A closed interval (in mathematics denoted by square brackets) contains its endpoints, i.e. the closed interval
[0, 5]is characterized by the conditions0 <= x <= 5. This is whatclosed='both'stands for. An open interval (in mathematics denoted by parentheses) does not contain its endpoints, i.e. the open interval(0, 5)is characterized by the conditions0 < x < 5. This is whatclosed='neither'stands for. Intervals can also be half-open or half-closed, i.e.[0, 5)is described by0 <= x < 5(closed='left') and(0, 5]is described by0 < x <= 5(closed='right').Examples
It is possible to build Intervals of different types, like numeric ones:
>>> iv = pd.Interval(left=0, right=5) >>> iv Interval(0, 5, closed='right')
You can check if an element belongs to it, or if it contains another interval:
>>> 2.5 in iv True >>> pd.Interval(left=2, right=5, closed='both') in iv True
You can test the bounds (
closed='right', so0 < x <= 5):>>> 0 in iv False >>> 5 in iv True >>> 0.0001 in iv True
Calculate its length
>>> iv.length 5
You can operate with + and * over an Interval and the operation is applied to each of its bounds, so the result depends on the type of the bound elements
>>> shifted_iv = iv + 3 >>> shifted_iv Interval(3, 8, closed='right') >>> extended_iv = iv * 10.0 >>> extended_iv Interval(0.0, 50.0, closed='right')
To create a time interval you can use Timestamps as the bounds
>>> year_2017 = pd.Interval(pd.Timestamp('2017-01-01 00:00:00'), ... pd.Timestamp('2018-01-01 00:00:00'), ... closed='left') >>> pd.Timestamp('2017-01-01 00:00') in year_2017 True >>> year_2017.length Timedelta('365 days 00:00:00')
- closed
String describing the inclusive side the intervals.
Either
left,right,bothorneither.
- left
Left bound for the interval.
- overlaps()
Check whether two Interval objects overlap.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.
- Parameters:
other (Interval) – Interval to check against for an overlap.
- Returns:
True if the two intervals overlap.
- Return type:
See also
IntervalArray.overlapsThe corresponding method for IntervalArray.
IntervalIndex.overlapsThe corresponding method for IntervalIndex.
Examples
>>> i1 = pd.Interval(0, 2) >>> i2 = pd.Interval(1, 3) >>> i1.overlaps(i2) True >>> i3 = pd.Interval(4, 5) >>> i1.overlaps(i3) False
Intervals that share closed endpoints overlap:
>>> i4 = pd.Interval(0, 1, closed='both') >>> i5 = pd.Interval(1, 2, closed='both') >>> i4.overlaps(i5) True
Intervals that only have an open endpoint in common do not overlap:
>>> i6 = pd.Interval(1, 2, closed='neither') >>> i4.overlaps(i6) False
- right
Right bound for the interval.
- class pandas.IntervalDtype[source]
An ExtensionDtype for Interval data.
This is not an actual numpy dtype, but a duck type.
- Parameters:
subtype (str, np.dtype) – The dtype of the Interval bounds.
closed (str_type | None) –
- subtype
- None()
Examples
>>> pd.IntervalDtype(subtype='int64', closed='both') interval[int64, both]
- name = 'interval'
- num = 103
- property closed
- property subtype
The dtype of the Interval bounds.
- classmethod construct_array_type()[source]
Return the array type associated with this dtype.
- Return type:
- classmethod construct_from_string(string)[source]
attempt to construct this type from a string, raise a TypeError if its not possible
- Parameters:
string (str) –
- Return type:
- property type: type[pandas._libs.interval.Interval]
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class pandas.IntervalIndex[source]
Immutable index of intervals that are closed on the same side.
New in version 0.20.0.
- Parameters:
data (array-like (1-dimensional)) – Array-like (ndarray,
DateTimeArray,TimeDeltaArray) containing Interval objects from which to build the IntervalIndex.closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.
dtype (dtype or None, default None) – If None, dtype will be inferred.
copy (bool, default False) – Copy the input data.
name (object, optional) – Name to be stored in the index.
verify_integrity (bool, default True) – Verify that the IntervalIndex is valid.
- Return type:
- left
- right
- closed
- Type:
IntervalClosedType
- mid
- length
- is_empty
- is_overlapping
- values
- from_arrays()[source]
- Parameters:
closed (IntervalClosedType) –
name (Hashable) –
copy (bool) –
dtype (Dtype | None) –
- Return type:
- from_tuples()[source]
- Parameters:
closed (IntervalClosedType) –
name (Hashable) –
copy (bool) –
dtype (Dtype | None) –
- Return type:
- from_breaks()[source]
- Parameters:
closed (IntervalClosedType | None) –
name (Hashable) –
copy (bool) –
dtype (Dtype | None) –
- Return type:
- contains()
- overlaps()
- set_closed()
- to_tuples()
See also
IndexThe base pandas Index type.
IntervalA bounded slice-like interval; the elements of an IntervalIndex.
interval_rangeFunction to create a fixed frequency IntervalIndex.
cutBin values into discrete Intervals.
qcutBin values into equal-sized Intervals based on rank or sample quantiles.
Notes
See the user guide for more.
Examples
A new
IntervalIndexis typically constructed usinginterval_range():>>> pd.interval_range(start=0, end=5) IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
It may also be constructed using one of the constructor methods:
IntervalIndex.from_arrays(),IntervalIndex.from_breaks(), andIntervalIndex.from_tuples().See further examples in the doc strings of
interval_rangeand the mentioned constructor methods.- closed: IntervalClosedType
String describing the inclusive side the intervals.
Either
left,right,bothorneither.
- is_non_overlapping_monotonic: bool
Return a boolean whether the IntervalArray is non-overlapping and monotonic.
Non-overlapping means (no Intervals share points), and monotonic means either monotonic increasing or monotonic decreasing.
- property closed_left
Check if the interval is closed on the left side.
For the meaning of closed and open see
Interval.- Returns:
True if the Interval is closed on the left-side.
- Return type:
See also
Interval.closed_rightCheck if the interval is closed on the right side.
Interval.open_leftBoolean inverse of closed_left.
Examples
>>> iv = pd.Interval(0, 5, closed='left') >>> iv.closed_left True
>>> iv = pd.Interval(0, 5, closed='right') >>> iv.closed_left False
- property closed_right
Check if the interval is closed on the right side.
For the meaning of closed and open see
Interval.- Returns:
True if the Interval is closed on the left-side.
- Return type:
See also
Interval.closed_leftCheck if the interval is closed on the left side.
Interval.open_rightBoolean inverse of closed_right.
Examples
>>> iv = pd.Interval(0, 5, closed='both') >>> iv.closed_right True
>>> iv = pd.Interval(0, 5, closed='left') >>> iv.closed_right False
- property open_left
Check if the interval is open on the left side.
For the meaning of closed and open see
Interval.- Returns:
True if the Interval is not closed on the left-side.
- Return type:
See also
Interval.open_rightCheck if the interval is open on the right side.
Interval.closed_leftBoolean inverse of open_left.
Examples
>>> iv = pd.Interval(0, 5, closed='neither') >>> iv.open_left True
>>> iv = pd.Interval(0, 5, closed='both') >>> iv.open_left False
- property open_right
Check if the interval is open on the right side.
For the meaning of closed and open see
Interval.- Returns:
True if the Interval is not closed on the left-side.
- Return type:
See also
Interval.open_leftCheck if the interval is open on the left side.
Interval.closed_rightBoolean inverse of open_right.
Examples
>>> iv = pd.Interval(0, 5, closed='left') >>> iv.open_right True
>>> iv = pd.Interval(0, 5) >>> iv.open_right False
- classmethod from_breaks(breaks, closed='right', name=None, copy=False, dtype=None)[source]
Construct an IntervalIndex from an array of splits.
- Parameters:
breaks (array-like (1-dimensional)) – Left and right bounds for each interval.
closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.
name (str, optional) – Name of the resulting IntervalIndex.
copy (bool, default False) – Copy the data.
dtype (dtype or None, default None) – If None, dtype will be inferred.
- Return type:
See also
interval_rangeFunction to create a fixed frequency IntervalIndex.
IntervalIndex.from_arraysConstruct from a left and right array.
IntervalIndex.from_tuplesConstruct from a sequence of tuples.
Examples
>>> pd.IntervalIndex.from_breaks([0, 1, 2, 3]) IntervalIndex([(0, 1], (1, 2], (2, 3]], dtype='interval[int64, right]')
- classmethod from_arrays(left, right, closed='right', name=None, copy=False, dtype=None)[source]
Construct from two arrays defining the left and right bounds.
- Parameters:
left (array-like (1-dimensional)) – Left bounds for each interval.
right (array-like (1-dimensional)) – Right bounds for each interval.
closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.
name (str, optional) – Name of the resulting IntervalIndex.
copy (bool, default False) – Copy the data.
dtype (dtype, optional) – If None, dtype will be inferred.
- Return type:
- Raises:
ValueError – When a value is missing in only one of left or right. When a value in left is greater than the corresponding value in right.
See also
interval_rangeFunction to create a fixed frequency IntervalIndex.
IntervalIndex.from_breaksConstruct an IntervalIndex from an array of splits.
IntervalIndex.from_tuplesConstruct an IntervalIndex from an array-like of tuples.
Notes
Each element of left must be less than or equal to the right element at the same position. If an element is missing, it must be missing in both left and right. A TypeError is raised when using an unsupported type for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes are not supported.
Examples
>>> pd.IntervalIndex.from_arrays([0, 1, 2], [1, 2, 3]) IntervalIndex([(0, 1], (1, 2], (2, 3]], dtype='interval[int64, right]')
- classmethod from_tuples(data, closed='right', name=None, copy=False, dtype=None)[source]
Construct an IntervalIndex from an array-like of tuples.
- Parameters:
data (array-like (1-dimensional)) – Array of tuples.
closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.
name (str, optional) – Name of the resulting IntervalIndex.
copy (bool, default False) – By-default copy the data, this is compat only and ignored.
dtype (dtype or None, default None) – If None, dtype will be inferred.
- Return type:
See also
interval_rangeFunction to create a fixed frequency IntervalIndex.
IntervalIndex.from_arraysConstruct an IntervalIndex from a left and right array.
IntervalIndex.from_breaksConstruct an IntervalIndex from an array of splits.
Examples
>>> pd.IntervalIndex.from_tuples([(0, 1), (1, 2)]) IntervalIndex([(0, 1], (1, 2]], dtype='interval[int64, right]')
- memory_usage(deep=False)[source]
Memory usage of the values.
- Parameters:
deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.
- Return type:
bytes used
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of the array.
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy
- is_monotonic_decreasing
Return True if the IntervalIndex is monotonic decreasing (only equal or decreasing values), else False
- is_unique
Return True if the IntervalIndex contains unique elements, else False.
- property is_overlapping: bool
Return True if the IntervalIndex has overlapping intervals, else False.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.
- Returns:
Boolean indicating if the IntervalIndex has overlapping intervals.
- Return type:
See also
Interval.overlapsCheck whether two Interval objects overlap.
IntervalIndex.overlapsCheck an IntervalIndex elementwise for overlaps.
Examples
>>> index = pd.IntervalIndex.from_tuples([(0, 2), (1, 3), (4, 5)]) >>> index IntervalIndex([(0, 2], (1, 3], (4, 5]], dtype='interval[int64, right]') >>> index.is_overlapping True
Intervals that share closed endpoints overlap:
>>> index = pd.interval_range(0, 3, closed='both') >>> index IntervalIndex([[0, 1], [1, 2], [2, 3]], dtype='interval[int64, both]') >>> index.is_overlapping True
Intervals that only have an open endpoint in common do not overlap:
>>> index = pd.interval_range(0, 3, closed='left') >>> index IntervalIndex([[0, 1), [1, 2), [2, 3)], dtype='interval[int64, left]') >>> index.is_overlapping False
- get_loc(key)[source]
Get integer location, slice or boolean mask for requested label.
- Parameters:
key (label) –
- Return type:
int if unique index, slice if monotonic index, else mask
Examples
>>> i1, i2 = pd.Interval(0, 1), pd.Interval(1, 2) >>> index = pd.IntervalIndex([i1, i2]) >>> index.get_loc(1) 0
You can also supply a point inside an interval.
>>> index.get_loc(1.5) 1
If a label is in several intervals, you get the locations of all the relevant intervals.
>>> i3 = pd.Interval(0, 2) >>> overlapping_index = pd.IntervalIndex([i1, i2, i3]) >>> overlapping_index.get_loc(0.5) array([ True, False, True])
Only exact matches will be returned if an interval is provided.
>>> index.get_loc(pd.Interval(0, 1)) 0
- get_indexer_non_unique(target)[source]
Compute indexer and mask for new index given the current index.
The indexer should be then used as an input to ndarray.take to align the current data to the new index.
- Parameters:
target (IntervalIndex or list of Intervals) –
- Returns:
indexer (np.ndarray[np.intp]) – Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
missing (np.ndarray[np.intp]) – An indexer into the target of the values not found. These correspond to the -1 in the indexer array.
- Return type:
tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]
Examples
>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['b', 'b']) (array([1, 3, 4, 1, 3, 4]), array([], dtype=int64))
In the example below there are no matched values.
>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['q', 'r', 't']) (array([-1, -1, -1]), array([0, 1, 2]))
For this reason, the returned
indexercontains only integers equal to -1. It demonstrates that there’s no match between the index and thetargetvalues at these positions. The mask [0, 1, 2] in the return value shows that the first, second, and third elements are missing.Notice that the return value is a tuple contains two items. In the example below the first item is an array of locations in
index. The second item is a mask shows that the first and third elements are missing.>>> index = pd.Index(['c', 'b', 'a', 'b', 'b']) >>> index.get_indexer_non_unique(['f', 'b', 's']) (array([-1, 1, 3, 4, -1]), array([0, 2]))
- left
- right
- mid
- contains(*args, **kwargs)
Check elementwise if the Intervals contain the value.
Return a boolean mask whether the value is contained in the Intervals of the IntervalArray.
- Parameters:
other (scalar) – The value to check whether it is contained in the Intervals.
- Return type:
boolean array
See also
Interval.containsCheck whether Interval object contains value.
IntervalArray.overlapsCheck if an Interval overlaps the values in the IntervalArray.
Examples
>>> intervals = pd.arrays.IntervalArray.from_tuples([(0, 1), (1, 3), (2, 4)]) >>> intervals <IntervalArray> [(0, 1], (1, 3], (2, 4]] Length: 3, dtype: interval[int64, right]
>>> intervals.contains(0.5) array([ True, False, False])
- property is_empty
Indicates if an interval is empty, meaning it contains no points.
- Returns:
A boolean indicating if a scalar
Intervalis empty, or a booleanndarraypositionally indicating if anIntervalin anIntervalArrayorIntervalIndexis empty.- Return type:
bool or ndarray
See also
Interval.lengthReturn the length of the Interval.
Examples
An
Intervalthat contains points is not empty:>>> pd.Interval(0, 1, closed='right').is_empty False
An
Intervalthat does not contain any points is empty:>>> pd.Interval(0, 0, closed='right').is_empty True >>> pd.Interval(0, 0, closed='left').is_empty True >>> pd.Interval(0, 0, closed='neither').is_empty True
An
Intervalthat contains a single point is not empty:>>> pd.Interval(0, 0, closed='both').is_empty False
An
IntervalArrayorIntervalIndexreturns a booleanndarraypositionally indicating if anIntervalis empty:>>> ivs = [pd.Interval(0, 0, closed='neither'), ... pd.Interval(1, 2, closed='neither')] >>> pd.arrays.IntervalArray(ivs).is_empty array([ True, False])
Missing values are not considered empty:
>>> ivs = [pd.Interval(0, 0, closed='neither'), np.nan] >>> pd.IntervalIndex(ivs).is_empty array([ True, False])
- overlaps(*args, **kwargs)
Check elementwise if an Interval overlaps the values in the IntervalArray.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.
- Parameters:
other (IntervalArray) – Interval to check against for an overlap.
- Returns:
Boolean array positionally indicating where an overlap occurs.
- Return type:
ndarray
See also
Interval.overlapsCheck whether two Interval objects overlap.
Examples
>>> data = [(0, 1), (1, 3), (2, 4)] >>> intervals = pd.arrays.IntervalArray.from_tuples(data) >>> intervals <IntervalArray> [(0, 1], (1, 3], (2, 4]] Length: 3, dtype: interval[int64, right]
>>> intervals.overlaps(pd.Interval(0.5, 1.5)) array([ True, True, False])
Intervals that share closed endpoints overlap:
>>> intervals.overlaps(pd.Interval(1, 3, closed='left')) array([ True, True, True])
Intervals that only have an open endpoint in common do not overlap:
>>> intervals.overlaps(pd.Interval(1, 2, closed='right')) array([False, True, False])
- set_closed(*args, **kwargs)
Return an identical IntervalArray closed on the specified side.
- Parameters:
closed ({'left', 'right', 'both', 'neither'}) – Whether the intervals are closed on the left-side, right-side, both or neither.
- Return type:
IntervalArray
Examples
>>> index = pd.arrays.IntervalArray.from_breaks(range(4)) >>> index <IntervalArray> [(0, 1], (1, 2], (2, 3]] Length: 3, dtype: interval[int64, right] >>> index.set_closed('both') <IntervalArray> [[0, 1], [1, 2], [2, 3]] Length: 3, dtype: interval[int64, both]
- class pandas.MultiIndex[source]
A multi-level, or hierarchical, index object for pandas objects.
- Parameters:
levels (sequence of arrays) – The unique labels for each level.
codes (sequence of arrays) – Integers for each level designating which label at each location.
sortorder (optional int) – Level of sortedness (must be lexicographically sorted by that level).
names (optional sequence of objects) – Names for each of the index levels. (name is accepted for compat).
copy (bool, default False) – Copy the meta-data.
verify_integrity (bool, default True) – Check that the levels/codes are consistent and valid.
- Return type:
- names
- levels
- codes
- nlevels
- levshape
- dtypes
- from_arrays()[source]
- droplevel()
- get_indexer()
See also
MultiIndex.from_arraysConvert list of arrays to MultiIndex.
MultiIndex.from_productCreate a MultiIndex from the cartesian product of iterables.
MultiIndex.from_tuplesConvert list of tuples to a MultiIndex.
MultiIndex.from_frameMake a MultiIndex from a DataFrame.
IndexThe base pandas Index type.
Notes
See the user guide for more.
Examples
A new
MultiIndexis typically constructed using one of the helper methodsMultiIndex.from_arrays(),MultiIndex.from_product()andMultiIndex.from_tuples(). For example (using.from_arrays):>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']] >>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], names=['number', 'color'])
See further examples for how to construct a MultiIndex in the doc strings of the mentioned helper methods.
- classmethod from_arrays(arrays, sortorder=None, names=_NoDefault.no_default)[source]
Convert arrays to MultiIndex.
- Parameters:
arrays (list / sequence of array-likes) – Each array-like gives one level’s value for each data point. len(arrays) is the number of levels.
sortorder (int or None) – Level of sortedness (must be lexicographically sorted by that level).
names (list / sequence of str, optional) – Names for the levels in the index.
- Return type:
See also
MultiIndex.from_tuplesConvert list of tuples to MultiIndex.
MultiIndex.from_productMake a MultiIndex from cartesian product of iterables.
MultiIndex.from_frameMake a MultiIndex from a DataFrame.
Examples
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']] >>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], names=['number', 'color'])
- classmethod from_tuples(tuples, sortorder=None, names=None)[source]
Convert list of tuples to MultiIndex.
- Parameters:
- Return type:
See also
MultiIndex.from_arraysConvert list of arrays to MultiIndex.
MultiIndex.from_productMake a MultiIndex from cartesian product of iterables.
MultiIndex.from_frameMake a MultiIndex from a DataFrame.
Examples
>>> tuples = [(1, 'red'), (1, 'blue'), ... (2, 'red'), (2, 'blue')] >>> pd.MultiIndex.from_tuples(tuples, names=('number', 'color')) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], names=['number', 'color'])
- classmethod from_product(iterables, sortorder=None, names=_NoDefault.no_default)[source]
Make a MultiIndex from the cartesian product of multiple iterables.
- Parameters:
iterables (list / sequence of iterables) – Each iterable has unique labels for each level of the index.
sortorder (int or None) – Level of sortedness (must be lexicographically sorted by that level).
names (list / sequence of str, optional) – Names for the levels in the index. If not explicitly provided, names will be inferred from the elements of iterables if an element has a name attribute.
- Return type:
See also
MultiIndex.from_arraysConvert list of arrays to MultiIndex.
MultiIndex.from_tuplesConvert list of tuples to MultiIndex.
MultiIndex.from_frameMake a MultiIndex from a DataFrame.
Examples
>>> numbers = [0, 1, 2] >>> colors = ['green', 'purple'] >>> pd.MultiIndex.from_product([numbers, colors], ... names=['number', 'color']) MultiIndex([(0, 'green'), (0, 'purple'), (1, 'green'), (1, 'purple'), (2, 'green'), (2, 'purple')], names=['number', 'color'])
- classmethod from_frame(df, sortorder=None, names=None)[source]
Make a MultiIndex from a DataFrame.
- Parameters:
df (DataFrame) – DataFrame to be converted to MultiIndex.
sortorder (int, optional) – Level of sortedness (must be lexicographically sorted by that level).
names (list-like, optional) – If no names are provided, use the column names, or tuple of column names if the columns is a MultiIndex. If a sequence, overwrite names with the given sequence.
- Returns:
The MultiIndex representation of the given DataFrame.
- Return type:
See also
MultiIndex.from_arraysConvert list of arrays to MultiIndex.
MultiIndex.from_tuplesConvert list of tuples to MultiIndex.
MultiIndex.from_productMake a MultiIndex from cartesian product of iterables.
Examples
>>> df = pd.DataFrame([['HI', 'Temp'], ['HI', 'Precip'], ... ['NJ', 'Temp'], ['NJ', 'Precip']], ... columns=['a', 'b']) >>> df a b 0 HI Temp 1 HI Precip 2 NJ Temp 3 NJ Precip
>>> pd.MultiIndex.from_frame(df) MultiIndex([('HI', 'Temp'), ('HI', 'Precip'), ('NJ', 'Temp'), ('NJ', 'Precip')], names=['a', 'b'])
Using explicit names, instead of the column names
>>> pd.MultiIndex.from_frame(df, names=['state', 'observation']) MultiIndex([('HI', 'Temp'), ('HI', 'Precip'), ('NJ', 'Temp'), ('NJ', 'Precip')], names=['state', 'observation'])
- property values: ndarray
Return an array representing the data in the Index.
Warning
We recommend using
Index.arrayorIndex.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.- Returns:
array
- Return type:
numpy.ndarray or ExtensionArray
See also
Index.arrayReference to the underlying data.
Index.to_numpyA NumPy array representing the underlying data.
- property array
Raises a ValueError for MultiIndex because there’s no single array backing a MultiIndex.
- Raises:
- dtypes
Return the dtypes as a Series for the underlying MultiIndex.
- levels
- set_levels(levels, *, level=None, verify_integrity=True)[source]
Set new levels on MultiIndex. Defaults to returning new index.
- Parameters:
- Return type:
Examples
>>> idx = pd.MultiIndex.from_tuples( ... [ ... (1, "one"), ... (1, "two"), ... (2, "one"), ... (2, "two"), ... (3, "one"), ... (3, "two") ... ], ... names=["foo", "bar"] ... ) >>> idx MultiIndex([(1, 'one'), (1, 'two'), (2, 'one'), (2, 'two'), (3, 'one'), (3, 'two')], names=['foo', 'bar'])
>>> idx.set_levels([['a', 'b', 'c'], [1, 2]]) MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)], names=['foo', 'bar']) >>> idx.set_levels(['a', 'b', 'c'], level=0) MultiIndex([('a', 'one'), ('a', 'two'), ('b', 'one'), ('b', 'two'), ('c', 'one'), ('c', 'two')], names=['foo', 'bar']) >>> idx.set_levels(['a', 'b'], level='bar') MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')], names=['foo', 'bar'])
If any of the levels passed to
set_levels()exceeds the existing length, all of the values from that argument will be stored in the MultiIndex levels, though the values will be truncated in the MultiIndex output.>>> idx.set_levels([['a', 'b', 'c'], [1, 2, 3, 4]], level=[0, 1]) MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)], names=['foo', 'bar']) >>> idx.set_levels([['a', 'b', 'c'], [1, 2, 3, 4]], level=[0, 1]).levels FrozenList([['a', 'b', 'c'], [1, 2, 3, 4]])
- property nlevels: int
Integer number of levels in this MultiIndex.
Examples
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']]) >>> mi MultiIndex([('a', 'b', 'c')], ) >>> mi.nlevels 3
- property levshape: Tuple[int, ...]
A tuple with the length of each level.
Examples
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']]) >>> mi MultiIndex([('a', 'b', 'c')], ) >>> mi.levshape (1, 1, 1)
- property codes
- set_codes(codes, *, level=None, verify_integrity=True)[source]
Set new codes on MultiIndex. Defaults to returning new index.
- Parameters:
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
new index (of same type and class…etc) or None
Examples
>>> idx = pd.MultiIndex.from_tuples( ... [(1, "one"), (1, "two"), (2, "one"), (2, "two")], names=["foo", "bar"] ... ) >>> idx MultiIndex([(1, 'one'), (1, 'two'), (2, 'one'), (2, 'two')], names=['foo', 'bar'])
>>> idx.set_codes([[1, 0, 1, 0], [0, 0, 1, 1]]) MultiIndex([(2, 'one'), (1, 'one'), (2, 'two'), (1, 'two')], names=['foo', 'bar']) >>> idx.set_codes([1, 0, 1, 0], level=0) MultiIndex([(2, 'one'), (1, 'two'), (2, 'one'), (1, 'two')], names=['foo', 'bar']) >>> idx.set_codes([0, 0, 1, 1], level='bar') MultiIndex([(1, 'one'), (1, 'one'), (2, 'two'), (2, 'two')], names=['foo', 'bar']) >>> idx.set_codes([[1, 0, 1, 0], [0, 0, 1, 1]], level=[0, 1]) MultiIndex([(2, 'one'), (1, 'one'), (2, 'two'), (1, 'two')], names=['foo', 'bar'])
- copy(names=None, deep=False, name=None)[source]
Make a copy of this object.
Names, dtype, levels and codes can be passed and will be set on new copy.
- Parameters:
names (sequence, optional) –
deep (bool, default False) –
name (Label) – Kept for compatibility with 1-dimensional Index. Should not be used.
- Return type:
Notes
In most cases, there should be no functional difference from using
deep, but ifdeepis passed it will attempt to deepcopy. This could be potentially expensive on large MultiIndex objects.Examples
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']]) >>> mi MultiIndex([('a', 'b', 'c')], ) >>> mi.copy() MultiIndex([('a', 'b', 'c')], )
- dtype
- memory_usage(deep=False)[source]
Memory usage of the values.
- Parameters:
deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.
- Return type:
bytes used
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of the array.
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy
- nbytes
return the number of bytes in the underlying data
- format(name=None, formatter=None, na_rep=None, names=False, space=2, sparsify=None, adjoin=True)[source]
Render a string representation of the Index.
- property names: FrozenList
Names of levels in MultiIndex.
Examples
>>> mi = pd.MultiIndex.from_arrays( ... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z']) >>> mi MultiIndex([(1, 3, 5), (2, 4, 6)], names=['x', 'y', 'z']) >>> mi.names FrozenList(['x', 'y', 'z'])
- inferred_type
- is_monotonic_increasing
Return a boolean if the values are equal or increasing.
- is_monotonic_decreasing
Return a boolean if the values are equal or decreasing.
- duplicated(keep='first')[source]
Indicate duplicate index values.
Duplicated values are indicated as
Truevalues in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.- Parameters:
keep ({'first', 'last', False}, default 'first') –
The value or values in a set of duplicates to mark as missing.
’first’ : Mark duplicates as
Trueexcept for the first occurrence.’last’ : Mark duplicates as
Trueexcept for the last occurrence.False: Mark all duplicates asTrue.
- Return type:
np.ndarray[bool]
See also
Series.duplicatedEquivalent method on pandas.Series.
DataFrame.duplicatedEquivalent method on pandas.DataFrame.
Index.drop_duplicatesRemove duplicate values from Index.
Examples
By default, for each set of duplicated values, the first occurrence is set to False and all others to True:
>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> idx.duplicated() array([False, False, True, False, True])
which is equivalent to
>>> idx.duplicated(keep='first') array([False, False, True, False, True])
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> idx.duplicated(keep='last') array([ True, False, True, False, False])
By setting keep on
False, all duplicates are True:>>> idx.duplicated(keep=False) array([ True, False, True, False, True])
- dropna(how='any')[source]
Return Index without NA/NaN values.
- Parameters:
how ({'any', 'all'}, default 'any') – If the Index is a MultiIndex, drop the value when any or all levels are NaN.
- Return type:
- get_level_values(level)[source]
Return vector of label values for requested level.
Length of returned vector is equal to the length of the index.
- Parameters:
level (int or str) –
levelis either the integer position of the level in the MultiIndex, or the name of the level.- Returns:
Values is a level of this MultiIndex converted to a single
Index(or subclass thereof).- Return type:
Notes
If the level contains missing values, the result may be casted to
floatwith missing values specified asNaN. This is because the level is converted to a regularIndex.Examples
Create a MultiIndex:
>>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def'))) >>> mi.names = ['level_1', 'level_2']
Get level values by supplying level as either integer or name:
>>> mi.get_level_values(0) Index(['a', 'b', 'c'], dtype='object', name='level_1') >>> mi.get_level_values('level_2') Index(['d', 'e', 'f'], dtype='object', name='level_2')
If a level contains missing values, the return type of the level may be cast to
float.>>> pd.MultiIndex.from_arrays([[1, None, 2], [3, 4, 5]]).dtypes level_0 int64 level_1 int64 dtype: object >>> pd.MultiIndex.from_arrays([[1, None, 2], [3, 4, 5]]).get_level_values(0) Index([1.0, nan, 2.0], dtype='float64')
- unique(level=None)[source]
Return unique values in the index.
Unique values are returned in order of appearance, this does NOT sort.
- Parameters:
level (int or hashable, optional) – Only return values from specified level (for MultiIndex). If int, gets the level by integer position, else by level name.
- Return type:
See also
uniqueNumpy array of unique values in that column.
Series.uniqueReturn unique values of Series object.
- to_frame(index=True, name=_NoDefault.no_default, allow_duplicates=False)[source]
Create a DataFrame with the levels of the MultiIndex as columns.
Column ordering is determined by the DataFrame constructor with data as a dict.
- Parameters:
index (bool, default True) – Set the index of the returned DataFrame as the original MultiIndex.
name (list / sequence of str, optional) – The passed names should substitute index level names.
allow_duplicates (bool, optional default False) –
Allow duplicate column labels to be created.
New in version 1.5.0.
- Return type:
See also
DataFrameTwo-dimensional, size-mutable, potentially heterogeneous tabular data.
Examples
>>> mi = pd.MultiIndex.from_arrays([['a', 'b'], ['c', 'd']]) >>> mi MultiIndex([('a', 'c'), ('b', 'd')], )
>>> df = mi.to_frame() >>> df 0 1 a c a c b d b d
>>> df = mi.to_frame(index=False) >>> df 0 1 0 a c 1 b d
>>> df = mi.to_frame(name=['x', 'y']) >>> df x y a c a c b d b d
- to_flat_index()[source]
Convert a MultiIndex to an Index of Tuples containing the level values.
- Returns:
Index with the MultiIndex data represented in Tuples.
- Return type:
pd.Index
See also
MultiIndex.from_tuplesConvert flat index back to MultiIndex.
Notes
This method will simply return the caller if called by anything other than a MultiIndex.
Examples
>>> index = pd.MultiIndex.from_product( ... [['foo', 'bar'], ['baz', 'qux']], ... names=['a', 'b']) >>> index.to_flat_index() Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')
- remove_unused_levels()[source]
Create new MultiIndex from current that removes unused levels.
Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
- Return type:
Examples
>>> mi = pd.MultiIndex.from_product([range(2), list('ab')]) >>> mi MultiIndex([(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b')], )
>>> mi[2:] MultiIndex([(1, 'a'), (1, 'b')], )
The 0 from the first level is not represented and can be removed
>>> mi2 = mi[2:].remove_unused_levels() >>> mi2.levels FrozenList([[1], ['a', 'b']])
- take(indices, axis=0, allow_fill=True, fill_value=None, **kwargs)[source]
Return a new MultiIndex of the values selected by the indices.
For internal compatibility with numpy arrays.
- Parameters:
indices (array-like) – Indices to be taken.
axis (int, optional) – The axis over which to select values, always 0.
allow_fill (bool, default True) –
fill_value (scalar, default None) – If allow_fill=True and fill_value is not None, indices specified by -1 are regarded as NA. If Index doesn’t hold NA, raise ValueError.
self (MultiIndex) –
- Returns:
An index formed of elements at the given indices. Will be the same type as self, except for RangeIndex.
- Return type:
See also
numpy.ndarray.takeReturn an array formed from the elements of a at the given indices.
- append(other)[source]
Append a collection of Index options together.
- Parameters:
other (Index or list/tuple of indices) –
- Returns:
The combined index.
- Return type:
Examples
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b']]) >>> mi MultiIndex([('a', 'b')], ) >>> mi.append(mi) MultiIndex([('a', 'b'), ('a', 'b')], )
- argsort(*args, **kwargs)[source]
Return the integer indices that would sort the index.
- Parameters:
*args – Passed to numpy.ndarray.argsort.
**kwargs – Passed to numpy.ndarray.argsort.
- Returns:
Integer indices that would sort the index if used as an indexer.
- Return type:
np.ndarray[np.intp]
See also
numpy.argsortSimilar method for NumPy arrays.
Index.sort_valuesReturn sorted copy of Index.
Examples
>>> idx = pd.Index(['b', 'a', 'd', 'c']) >>> idx Index(['b', 'a', 'd', 'c'], dtype='object')
>>> order = idx.argsort() >>> order array([1, 0, 3, 2])
>>> idx[order] Index(['a', 'b', 'c', 'd'], dtype='object')
- repeat(repeats, axis=None)[source]
Repeat elements of a MultiIndex.
Returns a new MultiIndex where each element of the current MultiIndex is repeated consecutively a given number of times.
- Parameters:
repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty MultiIndex.
axis (None) – Must be
None. Has no effect but is accepted for compatibility with numpy.
- Returns:
Newly created MultiIndex with repeated elements.
- Return type:
See also
Series.repeatEquivalent function for Series.
numpy.repeatSimilar method for
numpy.ndarray.
Examples
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx Index(['a', 'b', 'c'], dtype='object') >>> idx.repeat(2) Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object') >>> idx.repeat([1, 2, 3]) Index(['a', 'b', 'b', 'c', 'c', 'c'], dtype='object')
- drop(codes, level=None, errors='raise')[source]
Make new MultiIndex with passed list of codes deleted.
- Parameters:
- Return type:
- swaplevel(i=-2, j=-1)[source]
Swap level i with level j.
Calling this method does not change the ordering of the values.
- Parameters:
- Returns:
A new MultiIndex.
- Return type:
See also
Series.swaplevelSwap levels i and j in a MultiIndex.
DataFrame.swaplevelSwap levels i and j in a MultiIndex on a particular axis.
Examples
>>> mi = pd.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']], ... codes=[[0, 0, 1, 1], [0, 1, 0, 1]]) >>> mi MultiIndex([('a', 'bb'), ('a', 'aa'), ('b', 'bb'), ('b', 'aa')], ) >>> mi.swaplevel(0, 1) MultiIndex([('bb', 'a'), ('aa', 'a'), ('bb', 'b'), ('aa', 'b')], )
- reorder_levels(order)[source]
Rearrange levels using input order. May not drop or duplicate levels.
- Parameters:
order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
- Return type:
Examples
>>> mi = pd.MultiIndex.from_arrays([[1, 2], [3, 4]], names=['x', 'y']) >>> mi MultiIndex([(1, 3), (2, 4)], names=['x', 'y'])
>>> mi.reorder_levels(order=[1, 0]) MultiIndex([(3, 1), (4, 2)], names=['y', 'x'])
>>> mi.reorder_levels(order=['y', 'x']) MultiIndex([(3, 1), (4, 2)], names=['y', 'x'])
- sortlevel(level=0, ascending=True, sort_remaining=True)[source]
Sort MultiIndex at the requested level.
The result will respect the original ordering of the associated factor at that level.
- Parameters:
level (list-like, int or str, default 0) – If a string is given, must be a name of the level. If list-like must be names or ints of levels.
ascending (bool, default True) – False to sort in descending order. Can also be a list to specify a directed ordering.
sort_remaining (sort by the remaining levels after level) –
- Returns:
sorted_index (pd.MultiIndex) – Resulting index.
indexer (np.ndarray[np.intp]) – Indices of output values in original index.
- Return type:
tuple[MultiIndex, npt.NDArray[np.intp]]
Examples
>>> mi = pd.MultiIndex.from_arrays([[0, 0], [2, 1]]) >>> mi MultiIndex([(0, 2), (0, 1)], )
>>> mi.sortlevel() (MultiIndex([(0, 1), (0, 2)], ), array([1, 0]))
>>> mi.sortlevel(sort_remaining=False) (MultiIndex([(0, 2), (0, 1)], ), array([0, 1]))
>>> mi.sortlevel(1) (MultiIndex([(0, 1), (0, 2)], ), array([1, 0]))
>>> mi.sortlevel(1, ascending=False) (MultiIndex([(0, 2), (0, 1)], ), array([0, 1]))
- get_slice_bound(label, side)[source]
For an ordered MultiIndex, compute slice bound that corresponds to given label.
Returns leftmost (one-past-the-rightmost if `side==’right’) position of given label.
- Parameters:
- Returns:
Index of label.
- Return type:
Notes
This method only works if level 0 index of the MultiIndex is lexsorted.
Examples
>>> mi = pd.MultiIndex.from_arrays([list('abbc'), list('gefd')])
Get the locations from the leftmost ‘b’ in the first level until the end of the multiindex:
>>> mi.get_slice_bound('b', side="left") 1
Like above, but if you get the locations from the rightmost ‘b’ in the first level and ‘f’ in the second level:
>>> mi.get_slice_bound(('b','f'), side="right") 3
See also
MultiIndex.get_locGet location for a label or a tuple of labels.
MultiIndex.get_locsGet location for a label/slice/list/mask or a sequence of such.
- slice_locs(start=None, end=None, step=None)[source]
For an ordered MultiIndex, compute the slice locations for input labels.
The input labels can be tuples representing partial levels, e.g. for a MultiIndex with 3 levels, you can pass a single value (corresponding to the first level), or a 1-, 2-, or 3-tuple.
- Parameters:
- Returns:
(start, end)
- Return type:
Notes
This method only works if the MultiIndex is properly lexsorted. So, if only the first 2 levels of a 3-level MultiIndex are lexsorted, you can only pass two levels to
.slice_locs.Examples
>>> mi = pd.MultiIndex.from_arrays([list('abbd'), list('deff')], ... names=['A', 'B'])
Get the slice locations from the beginning of ‘b’ in the first level until the end of the multiindex:
>>> mi.slice_locs(start='b') (1, 4)
Like above, but stop at the end of ‘b’ in the first level and ‘f’ in the second level:
>>> mi.slice_locs(start='b', end=('b', 'f')) (1, 3)
See also
MultiIndex.get_locGet location for a label or a tuple of labels.
MultiIndex.get_locsGet location for a label/slice/list/mask or a sequence of such.
- get_loc(key)[source]
Get location for a label or a tuple of labels.
The location is returned as an integer/slice or boolean mask.
- Parameters:
key (label or tuple of labels (one for each level)) –
- Returns:
If the key is past the lexsort depth, the return may be a boolean mask array, otherwise it is always a slice or int.
- Return type:
int, slice object or boolean mask
See also
Index.get_locThe get_loc method for (single-level) index.
MultiIndex.slice_locsGet slice location given start label(s) and end label(s).
MultiIndex.get_locsGet location for a label/slice/list/mask or a sequence of such.
Notes
The key cannot be a slice, list of same-level labels, a boolean mask, or a sequence of such. If you want to use those, use
MultiIndex.get_locs()instead.Examples
>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')])
>>> mi.get_loc('b') slice(1, 3, None)
>>> mi.get_loc(('b', 'e')) 1
- get_loc_level(key, level=0, drop_level=True)[source]
Get location and sliced index for requested label(s)/level(s).
- Parameters:
key (label or sequence of labels) –
level (int/level name or list thereof, optional) –
drop_level (bool, default True) – If
False, the resulting index will not drop any level.
- Returns:
A 2-tuple where the elements :
Element 0: int, slice object or boolean array.
Element 1: The resulting sliced multiindex/index. If the key contains all levels, this will be
None.- Return type:
See also
MultiIndex.get_locGet location for a label or a tuple of labels.
MultiIndex.get_locsGet location for a label/slice/list/mask or a sequence of such.
Examples
>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')], ... names=['A', 'B'])
>>> mi.get_loc_level('b') (slice(1, 3, None), Index(['e', 'f'], dtype='object', name='B'))
>>> mi.get_loc_level('e', level='B') (array([False, True, False]), Index(['b'], dtype='object', name='A'))
>>> mi.get_loc_level(['b', 'e']) (1, None)
- get_locs(seq)[source]
Get location for a sequence of labels.
- Parameters:
seq (label, slice, list, mask or a sequence of such) – You should use one of the above for each level. If a level should not be used, set it to
slice(None).- Returns:
NumPy array of integers suitable for passing to iloc.
- Return type:
numpy.ndarray
See also
MultiIndex.get_locGet location for a label or a tuple of labels.
MultiIndex.slice_locsGet slice location given start label(s) and end label(s).
Examples
>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')])
>>> mi.get_locs('b') array([1, 2], dtype=int64)
>>> mi.get_locs([slice(None), ['e', 'f']]) array([1, 2], dtype=int64)
>>> mi.get_locs([[True, False, True], slice('e', 'f')]) array([2], dtype=int64)
- truncate(before=None, after=None)[source]
Slice index between two labels / tuples, return new MultiIndex.
- Parameters:
- Returns:
The truncated MultiIndex.
- Return type:
Examples
>>> mi = pd.MultiIndex.from_arrays([['a', 'b', 'c'], ['x', 'y', 'z']]) >>> mi MultiIndex([('a', 'x'), ('b', 'y'), ('c', 'z')], ) >>> mi.truncate(before='a', after='b') MultiIndex([('a', 'x'), ('b', 'y')], )
- equals(other)[source]
Determines if two MultiIndex objects have the same labeling information (the levels themselves do not necessarily have to be the same)
See also
- equal_levels(other)[source]
Return True if the levels of both MultiIndex objects are the same
- Parameters:
other (MultiIndex) –
- Return type:
- astype(dtype, copy=True)[source]
Create an Index with values cast to dtypes.
The class of a new Index is determined by dtype. When conversion is impossible, a TypeError exception is raised.
- Parameters:
dtype (numpy dtype or pandas type) – Note that any signed integer dtype is treated as
'int64', and any unsigned integer dtype is treated as'uint64', regardless of the size.copy (bool, default True) – By default, astype always returns a newly allocated object. If copy is set to False and internal requirements on dtype are satisfied, the original data is used to create a new Index or the original Index is returned.
- Returns:
Index with values cast to specified dtype.
- Return type:
- putmask(mask, value)[source]
Return a new MultiIndex of the values set with the mask.
- Parameters:
mask (array like) –
value (MultiIndex) – Must either be the same length as self or length one
- Return type:
- isin(values, level=None)[source]
Return a boolean array where the index values are in values.
Compute boolean array of whether each index value is found in the passed set of values. The length of the returned boolean array matches the length of the index.
- Parameters:
- Returns:
NumPy array of boolean values.
- Return type:
np.ndarray[bool]
See also
Series.isinSame for Series.
DataFrame.isinSame method for DataFrames.
Notes
In the case of MultiIndex you must either specify values as a list-like object containing tuples that are the same length as the number of levels, or specify level. Otherwise it will raise a
ValueError.If level is specified:
if it is the name of one and only one index level, use that level;
otherwise it should be a number indicating level position.
Examples
>>> idx = pd.Index([1,2,3]) >>> idx Index([1, 2, 3], dtype='int64')
Check whether each index value in a list of values.
>>> idx.isin([1, 4]) array([ True, False, False])
>>> midx = pd.MultiIndex.from_arrays([[1,2,3], ... ['red', 'blue', 'green']], ... names=('number', 'color')) >>> midx MultiIndex([(1, 'red'), (2, 'blue'), (3, 'green')], names=['number', 'color'])
Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.
>>> midx.isin(['red', 'orange', 'yellow'], level='color') array([ True, False, False])
To check across the levels of a MultiIndex, pass a list of tuples:
>>> midx.isin([(1, 'red'), (3, 'red')]) array([ True, False, False])
For a DatetimeIndex, string values in values are converted to Timestamps.
>>> dates = ['2000-03-11', '2000-03-12', '2000-03-13'] >>> dti = pd.to_datetime(dates) >>> dti DatetimeIndex(['2000-03-11', '2000-03-12', '2000-03-13'], dtype='datetime64[ns]', freq=None)
>>> dti.isin(['2000-03-11']) array([ True, False, False])
- rename(names, *, level=None, inplace=False)
Set Index or MultiIndex name.
Able to set new names partially and by level.
- Parameters:
names (label or list of label or dict-like for MultiIndex) –
Name(s) to set.
Changed in version 1.3.0.
level (int, label or list of int or label, optional) –
If the index is a MultiIndex and names is not dict-like, level(s) to set (None for all levels). Otherwise level must be None.
Changed in version 1.3.0.
inplace (bool, default False) – Modifies the object directly, instead of creating a new Index or MultiIndex.
self (_IndexT) –
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
Index or None
See also
Index.renameAble to set new names without level.
Examples
>>> idx = pd.Index([1, 2, 3, 4]) >>> idx Index([1, 2, 3, 4], dtype='int64') >>> idx.set_names('quarter') Index([1, 2, 3, 4], dtype='int64', name='quarter')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'], ... [2018, 2019]]) >>> idx MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], ) >>> idx = idx.set_names(['kind', 'year']) >>> idx.set_names('species', level=0) MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['species', 'year'])
When renaming levels with a dict, levels can not be passed.
>>> idx.set_names({'kind': 'snake'}) MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], names=['snake', 'year'])
- class pandas.NamedAgg[source]
Helper for column specific aggregation with control over output column names.
Subclass of typing.NamedTuple.
- Parameters:
column (Hashable) – Column label in the DataFrame to apply aggfunc.
aggfunc (function or str) – Function to apply to the provided column. If string, the name of a built-in pandas function.
Examples
>>> df = pd.DataFrame({"key": [1, 1, 2], "a": [-1, 0, 1], 1: [10, 11, 12]}) >>> agg_a = pd.NamedAgg(column="a", aggfunc="min") >>> agg_1 = pd.NamedAgg(column=1, aggfunc=np.mean) >>> df.groupby("key").agg(result_a=agg_a, result_1=agg_1) result_a result_1 key 1 -1 10.5 2 1 12.0
- class pandas.Period
Represents a period of time.
- Parameters:
value (Period or str, default None) – The time period represented (e.g., ‘4Q2005’). This represents neither the start or the end of the period, but rather the entire period itself.
freq (str, default None) – One of pandas period strings or corresponding objects. Accepted strings are listed in the offset alias section in the user docs.
ordinal (int, default None) – The period offset from the proleptic Gregorian epoch.
year (int, default None) – Year value of the period.
month (int, default 1) – Month value of the period.
quarter (int, default None) – Quarter value of the period.
day (int, default 1) – Day value of the period.
hour (int, default 0) – Hour value of the period.
minute (int, default 0) – Minute value of the period.
second (int, default 0) – Second value of the period.
Examples
>>> period = pd.Period('2012-1-1', freq='D') >>> period Period('2012-01-01', 'D')
- class pandas.PeriodDtype[source]
An ExtensionDtype for Period data.
This is not an actual numpy dtype, but a duck type.
- Parameters:
freq (str or DateOffset) – The frequency of this PeriodDtype.
- freq
- None()
Examples
>>> pd.PeriodDtype(freq='D') period[D]
>>> pd.PeriodDtype(freq=pd.offsets.MonthEnd()) period[M]
- num = 102
- property freq
The frequency object of this PeriodDtype.
- classmethod construct_from_string(string)[source]
Strict construction from a string, raise a TypeError if not possible
- Parameters:
string (str) –
- Return type:
- property name: str
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- property na_value: NaTType
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- class pandas.PeriodIndex[source]
Immutable ndarray holding ordinal values indicating regular periods in time.
Index keys are boxed to Period objects which carries the metadata (eg, frequency information).
- Parameters:
data (array-like (1d int np.ndarray or PeriodArray), optional) – Optional period-like data to construct index with.
copy (bool) – Make a copy of input ndarray.
freq (str or period object, optional) – One of pandas period strings or corresponding objects.
dtype (str or PeriodDtype, default None) –
name (Hashable) –
- Return type:
- day
- dayofweek
- day_of_week
- dayofyear
- day_of_year
- days_in_month
- daysinmonth
- end_time
- freq
- Type:
BaseOffset
- freqstr
- hour
- is_leap_year
- minute
- month
- quarter
- qyear
- second
- start_time
- week
- weekday
- weekofyear
- year
- strftime()
See also
IndexThe base pandas Index type.
PeriodRepresents a period of time.
DatetimeIndexIndex with datetime64 data.
TimedeltaIndexIndex of timedelta64 data.
period_rangeCreate a fixed-frequency PeriodIndex.
Examples
>>> idx = pd.PeriodIndex(year=[2000, 2002], quarter=[1, 3]) >>> idx PeriodIndex(['2000Q1', '2002Q3'], dtype='period[Q-DEC]')
- asfreq(freq=None, how='E')[source]
Convert the PeriodArray to the specified frequency freq.
Equivalent to applying
pandas.Period.asfreq()with the given arguments to eachPeriodin this PeriodArray.- Parameters:
freq (str) – A frequency.
how (str {'E', 'S'}, default 'E') –
Whether the elements should be aligned to the end or start within pa period.
’E’, ‘END’, or ‘FINISH’ for end,
’S’, ‘START’, or ‘BEGIN’ for start.
January 31st (‘END’) vs. January 1st (‘START’) for example.
- Returns:
The transformed PeriodArray with the new frequency.
- Return type:
PeriodArray
See also
pandas.arrays.PeriodArray.asfreqConvert each Period in a PeriodArray to the given frequency.
Period.asfreqConvert a
Periodobject to the given frequency.
Examples
>>> pidx = pd.period_range('2010-01-01', '2015-01-01', freq='A') >>> pidx PeriodIndex(['2010', '2011', '2012', '2013', '2014', '2015'], dtype='period[A-DEC]')
>>> pidx.asfreq('M') PeriodIndex(['2010-12', '2011-12', '2012-12', '2013-12', '2014-12', '2015-12'], dtype='period[M]')
>>> pidx.asfreq('M', how='S') PeriodIndex(['2010-01', '2011-01', '2012-01', '2013-01', '2014-01', '2015-01'], dtype='period[M]')
- to_timestamp(freq=None, how='start')[source]
Cast to DatetimeArray/Index.
- Parameters:
freq (str or DateOffset, optional) – Target frequency. The default is ‘D’ for week or longer, ‘S’ otherwise.
how ({'s', 'e', 'start', 'end'}) – Whether to use the start or end of the time period being converted.
- Return type:
DatetimeArray/Index
- property hour
The hour of the period.
- property minute
The minute of the period.
- property second
The second of the period.
- property values: ndarray
Return an array representing the data in the Index.
Warning
We recommend using
Index.arrayorIndex.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.- Returns:
array
- Return type:
numpy.ndarray or ExtensionArray
See also
Index.arrayReference to the underlying data.
Index.to_numpyA NumPy array representing the underlying data.
- asof_locs(where, mask)[source]
where : array of timestamps mask : np.ndarray[bool]
Array of booleans where data is not NA.
- Parameters:
where (Index) –
mask (npt.NDArray[np.bool_]) –
- Return type:
np.ndarray
- property is_full: bool
Returns True if this PeriodIndex is range-like in that all Periods between start and end are present, in order.
- shift(periods=1, freq=None)[source]
Shift index by desired number of time frequency increments.
This method is for shifting the values of datetime-like indexes by a specified time increment a given number of times.
- Parameters:
periods (int, default 1) – Number of periods (or increments) to shift by, can be positive or negative.
freq (pandas.DateOffset, pandas.Timedelta or string, optional) – Frequency increment to shift by. If None, the index is shifted by its own freq attribute. Offset aliases are valid strings, e.g., ‘D’, ‘W’, ‘M’ etc.
- Returns:
Shifted index.
- Return type:
See also
Index.shiftShift values of Index.
PeriodIndex.shiftShift values of PeriodIndex.
- property day
The days of the period.
- property day_of_week
The day of the week with Monday=0, Sunday=6.
- property day_of_year
The ordinal day of the year.
- property dayofweek
The day of the week with Monday=0, Sunday=6.
- property dayofyear
The ordinal day of the year.
- property days_in_month
The number of days in the month.
- property daysinmonth
The number of days in the month.
- property end_time
Get the Timestamp for the end of the period.
- Return type:
See also
Period.start_timeReturn the start Timestamp.
Period.dayofyearReturn the day of year.
Period.daysinmonthReturn the days in that month.
Period.dayofweekReturn the day of the week.
- property is_leap_year
Logical indicating if the date belongs to a leap year.
- property month
The month as January=1, December=12.
- property quarter
The quarter of the date.
- property qyear
- property start_time
Get the Timestamp for the start of the period.
- Return type:
See also
Period.end_timeReturn the end Timestamp.
Period.dayofyearReturn the day of year.
Period.daysinmonthReturn the days in that month.
Period.dayofweekReturn the day of the week.
Examples
>>> period = pd.Period('2012-1-1', freq='D') >>> period Period('2012-01-01', 'D')
>>> period.start_time Timestamp('2012-01-01 00:00:00')
>>> period.end_time Timestamp('2012-01-01 23:59:59.999999999')
- strftime(*args, **kwargs)
Convert to Index using specified date_format.
Return an Index of formatted strings specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc.
Formats supported by the C strftime API but not by the python string format doc (such as “%R”, “%r”) are not officially supported and should be preferably replaced with their supported equivalents (such as “%H:%M”, “%I:%M:%S %p”).
Note that PeriodIndex support additional directives, detailed in Period.strftime.
- Parameters:
date_format (str) – Date format string (e.g. “%Y-%m-%d”).
- Returns:
NumPy ndarray of formatted strings.
- Return type:
ndarray[object]
See also
to_datetimeConvert the given argument to datetime.
DatetimeIndex.normalizeReturn DatetimeIndex with times to midnight.
DatetimeIndex.roundRound the DatetimeIndex to the specified freq.
DatetimeIndex.floorFloor the DatetimeIndex to the specified freq.
Timestamp.strftimeFormat a single Timestamp.
Period.strftimeFormat a single Period.
Examples
>>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"), ... periods=3, freq='s') >>> rng.strftime('%B %d, %Y, %r') Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM', 'March 10, 2018, 09:00:02 AM'], dtype='object')
- property week
The week ordinal of the year.
- property weekday
The day of the week with Monday=0, Sunday=6.
- property weekofyear
The week ordinal of the year.
- property year
The year of the period.
- class pandas.RangeIndex[source]
Immutable Index implementing a monotonic integer range.
RangeIndex is a memory-saving special case of an Index limited to representing monotonic ranges with a 64-bit dtype. Using RangeIndex may in some instances improve computing speed.
This is the default index type used by DataFrame and Series when no explicit index is provided by the user.
- Parameters:
start (int (default: 0), range, or other RangeIndex instance) – If int and “stop” is not given, interpreted as “stop” instead.
stop (int (default: 0)) –
step (int (default: 1)) –
dtype (np.int64) – Unused, accepted for homogeneity with other index types.
copy (bool, default False) – Unused, accepted for homogeneity with other index types.
name (object, optional) – Name to be stored in the index.
- Return type:
- start
- stop
- step
See also
IndexThe base pandas Index type.
- classmethod from_range(data, name=None, dtype=None)[source]
Create RangeIndex from a range object.
- Return type:
- Parameters:
data (range) –
dtype (Dtype | None) –
- nbytes
Return the number of bytes in the underlying data.
- memory_usage(deep=False)[source]
Memory usage of my values
- Parameters:
deep (bool) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption
- Return type:
bytes used
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False
See also
numpy.ndarray.nbytes
- property dtype: dtype
Return the dtype object of the underlying data.
- is_monotonic_increasing
- is_monotonic_decreasing
- get_loc(key)[source]
Get integer location, slice or boolean mask for requested label.
- Parameters:
key (label) –
- Return type:
int if unique index, slice if monotonic index, else mask
Examples
>>> unique_index = pd.Index(list('abc')) >>> unique_index.get_loc('b') 1
>>> monotonic_index = pd.Index(list('abbc')) >>> monotonic_index.get_loc('b') slice(1, 3, None)
>>> non_monotonic_index = pd.Index(list('abcb')) >>> non_monotonic_index.get_loc('b') array([False, True, False, True])
- tolist()[source]
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)
- Return type:
See also
numpy.ndarray.tolistReturn the array as an a.ndim-levels deep nested list of Python scalars.
- copy(name=None, deep=False)[source]
Make a copy of this object.
Name is set on the new object.
- Parameters:
name (Label, optional) – Set name for new object.
deep (bool, default False) –
- Returns:
Index refer to new object which is a copy of this object.
- Return type:
Notes
In most cases, there should be no functional difference from using
deep, but ifdeepis passed it will attempt to deepcopy.
- argsort(*args, **kwargs)[source]
Returns the indices that would sort the index and its underlying data.
- Return type:
np.ndarray[np.intp]
See also
numpy.ndarray.argsort
- factorize(sort=False, use_na_sentinel=True)[source]
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function
pandas.factorize(), and as a methodSeries.factorize()andIndex.factorize().- Parameters:
sort (bool, default False) – Sort uniques and shuffle codes to maintain the relationship.
use_na_sentinel (bool, default True) –
If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values.
New in version 1.5.0.
- Returns:
codes (ndarray) – An integer ndarray that’s an indexer into uniques.
uniques.take(codes)will have the same values as values.uniques (ndarray, Index, or Categorical) – The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.
Note
Even if there’s a missing value in values, uniques will not contain an entry for it.
- Return type:
tuple[npt.NDArray[np.intp], RangeIndex]
Notes
Reference the user guide for more examples.
Examples
These examples all show factorize as a top-level method like
pd.factorize(values). The results are identical for methods likeSeries.factorize().>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) >>> codes array([0, 0, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
With
sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) >>> codes array([1, 1, 0, 2, 1]) >>> uniques array(['a', 'b', 'c'], dtype=object)
When
use_na_sentinel=True(the default), missing values are indicated in the codes with the sentinel value-1and missing values are not included in uniques.>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) >>> codes array([ 0, -1, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques ['a', 'c'] Categories (3, object): ['a', 'b', 'c']
Notice that
'b'is inuniques.categories, despite not being present incat.values.For all other pandas objects, an Index of the appropriate type is returned.
>>> cat = pd.Series(['a', 'a', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques Index(['a', 'c'], dtype='object')
If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting
use_na_sentinel=False.>>> values = np.array([1, 2, 1, np.nan]) >>> codes, uniques = pd.factorize(values) # default: use_na_sentinel=True >>> codes array([ 0, 1, 0, -1]) >>> uniques array([1., 2.])
>>> codes, uniques = pd.factorize(values, use_na_sentinel=False) >>> codes array([0, 1, 0, 2]) >>> uniques array([ 1., 2., nan])
- sort_values(return_indexer=False, ascending=True, na_position='last', key=None)[source]
Return a sorted copy of the index.
Return a sorted copy of the index, and optionally return the indices that sorted the index itself.
- Parameters:
return_indexer (bool, default False) – Should the indices that would sort the index be returned.
ascending (bool, default True) – Should the index values be sorted in an ascending order.
na_position ({'first' or 'last'}, default 'last') –
Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
New in version 1.2.0.
key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape.New in version 1.1.0.
- Returns:
sorted_index (pandas.Index) – Sorted copy of the index.
indexer (numpy.ndarray, optional) – The indices that the index itself was sorted by.
See also
Series.sort_valuesSort values of a Series.
DataFrame.sort_valuesSort values in a DataFrame.
Examples
>>> idx = pd.Index([10, 100, 1, 1000]) >>> idx Index([10, 100, 1, 1000], dtype='int64')
Sort values in ascending order (default behavior).
>>> idx.sort_values() Index([1, 10, 100, 1000], dtype='int64')
Sort values in descending order, and also get the indices idx was sorted by.
>>> idx.sort_values(ascending=False, return_indexer=True) (Index([1000, 100, 10, 1], dtype='int64'), array([3, 1, 0, 2]))
- symmetric_difference(other, result_name=None, sort=None)[source]
Compute the symmetric difference of two Index objects.
- Parameters:
other (Index or array-like) –
result_name (str) –
sort (bool or None, default None) –
Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.
None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.
False : Do not sort the result.
True : Sort the result (which may raise TypeError).
- Return type:
Notes
symmetric_differencecontains elements that appear in eitheridx1oridx2but not both. Equivalent to the Index created byidx1.difference(idx2) | idx2.difference(idx1)with duplicates dropped.Examples
>>> idx1 = pd.Index([1, 2, 3, 4]) >>> idx2 = pd.Index([2, 3, 4, 5]) >>> idx1.symmetric_difference(idx2) Index([1, 5], dtype='int64')
- delete(loc)[source]
Make new Index with passed location(-s) deleted.
- Parameters:
loc (int or list of int) – Location of item(-s) which will be deleted. Use a list of locations to delete more than one value at the same time.
- Returns:
Will be same type as self, except for RangeIndex.
- Return type:
See also
numpy.deleteDelete any rows and column from NumPy array (ndarray).
Examples
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx.delete(1) Index(['a', 'c'], dtype='object')
>>> idx = pd.Index(['a', 'b', 'c']) >>> idx.delete([0, 2]) Index(['b'], dtype='object')
- insert(loc, item)[source]
Make new Index inserting new item at location.
Follows Python numpy.insert semantics for negative values.
- all(*args, **kwargs)[source]
Return whether all elements are Truthy.
- Parameters:
*args – Required for compatibility with numpy.
**kwargs – Required for compatibility with numpy.
- Returns:
A single element array-like may be converted to bool.
- Return type:
bool or array-like (if axis is specified)
See also
Index.anyReturn whether any element in an Index is True.
Series.anyReturn whether any element in a Series is True.
Series.allReturn whether all elements in a Series are True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
True, because nonzero integers are considered True.
>>> pd.Index([1, 2, 3]).all() True
False, because
0is considered False.>>> pd.Index([0, 1, 2]).all() False
- any(*args, **kwargs)[source]
Return whether any element is Truthy.
- Parameters:
*args – Required for compatibility with numpy.
**kwargs – Required for compatibility with numpy.
- Returns:
A single element array-like may be converted to bool.
- Return type:
bool or array-like (if axis is specified)
See also
Index.allReturn whether all elements are True.
Series.allReturn whether all elements are True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> index = pd.Index([0, 1, 2]) >>> index.any() True
>>> index = pd.Index([0, 0, 0]) >>> index.any() False
- class pandas.Series[source]
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.
- Parameters:
data (array-like, Iterable, dict, or scalar value) – Contains data stored in Series. If data is a dict, argument order is maintained.
index (array-like or Index (1d)) – Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtype (str, numpy.dtype, or ExtensionDtype, optional) – Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.
name (Hashable, default None) – The name to give to the Series.
copy (bool, default False) – Copy input data. Only affects Series or 1d ndarray input. See examples.
fastpath (bool) –
Notes
Please reference the User Guide for more information.
Examples
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64
The keys of the dictionary match with the Index values, hence the Index values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64
Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.
Constructing Series from a list with copy=False.
>>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.
Constructing Series from a 1d ndarray with copy=False.
>>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a view on the original data, so the data is changed as well.
- property hasnans: bool
Return True if there are any NaNs.
Enables various performance speedups.
- Return type:
- div(other, level=None, fill_value=None, axis=0)
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- rdiv(other, level=None, fill_value=None, axis=0)
Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.truedivElement-wise Floating division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- property dtype: DtypeObj
Return the dtype object of the underlying data.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.dtype dtype('int64')
- property dtypes: DtypeObj
Return the dtype object of the underlying data.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.dtypes dtype('int64')
- property name: Hashable
Return the name of the Series.
The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.
- Returns:
The name of the Series, also the column name if part of a DataFrame.
- Return type:
label (hashable object)
See also
Series.renameSets the Series name when given a scalar input.
Index.nameCorresponding Index property.
Examples
The Series name can be set initially when calling the constructor.
>>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers') >>> s 0 1 1 2 2 3 Name: Numbers, dtype: int64 >>> s.name = "Integers" >>> s 0 1 1 2 2 3 Name: Integers, dtype: int64
The name of a Series within a DataFrame is its column name.
>>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], ... columns=["Odd Numbers", "Even Numbers"]) >>> df Odd Numbers Even Numbers 0 1 2 1 3 4 2 5 6 >>> df["Even Numbers"].name 'Even Numbers'
- property values
Return Series as ndarray or ndarray-like depending on the dtype.
Warning
We recommend using
Series.arrayorSeries.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.- Return type:
numpy.ndarray or ndarray-like
See also
Series.arrayReference to the underlying data.
Series.to_numpyA NumPy array representing the underlying data.
Examples
>>> pd.Series([1, 2, 3]).values array([1, 2, 3])
>>> pd.Series(list('aabc')).values array(['a', 'a', 'b', 'c'], dtype=object)
>>> pd.Series(list('aabc')).astype('category').values ['a', 'a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c']
Timezone aware datetime data is converted to UTC:
>>> pd.Series(pd.date_range('20130101', periods=3, ... tz='US/Eastern')).values array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
- property array: ExtensionArray
The ExtensionArray of the data backing this Series or Index.
- Returns:
An ExtensionArray of the values stored within. For extension types, this is the actual array. For NumPy native types, this is a thin (no copy) wrapper around
numpy.ndarray..arraydiffers.valueswhich may require converting the data to a different form.- Return type:
ExtensionArray
See also
Index.to_numpySimilar method that always returns a NumPy array.
Series.to_numpySimilar method that always returns a NumPy array.
Notes
This table lays out the different array types for each extension dtype within pandas.
dtype
array type
category
Categorical
period
PeriodArray
interval
IntervalArray
IntegerNA
IntegerArray
string
StringArray
boolean
BooleanArray
datetime64[ns, tz]
DatetimeArray
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes
.arraywill be aarrays.NumpyExtensionArraywrapping the actual ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then useSeries.to_numpy()instead.Examples
For regular NumPy types like int, and float, a PandasArray is returned.
>>> pd.Series([1, 2, 3]).array <PandasArray> [1, 2, 3] Length: 3, dtype: int64
For extension types, like Categorical, the actual ExtensionArray is returned
>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a'])) >>> ser.array ['a', 'b', 'a'] Categories (2, object): ['a', 'b']
- ravel(order='C')[source]
Return the flattened underlying data as an ndarray or ExtensionArray.
- Returns:
Flattened data of the Series.
- Return type:
numpy.ndarray or ExtensionArray
- Parameters:
order (str) –
See also
numpy.ndarray.ravelReturn a flattened array.
- view(dtype=None)[source]
Create a new view of the Series.
This function will return a new Series with a view of the same underlying values in memory, optionally reinterpreted with a new data type. The new data type must preserve the same size in bytes as to not cause index misalignment.
- Parameters:
dtype (data type) – Data type object or one of their string representations.
- Returns:
A new Series object as a view of the same data in memory.
- Return type:
See also
numpy.ndarray.viewEquivalent numpy function to create a new view of the same data in memory.
Notes
Series are instantiated with
dtype=float64by default. Whilenumpy.ndarray.view()will return a view with the same data type as the original array,Series.view()(without specified dtype) will try usingfloat64and may fail if the original data type size in bytes is not the same.Examples
>>> s = pd.Series([-2, -1, 0, 1, 2], dtype='int8') >>> s 0 -2 1 -1 2 0 3 1 4 2 dtype: int8
The 8 bit signed integer representation of -1 is 0b11111111, but the same bytes represent 255 if read as an 8 bit unsigned integer:
>>> us = s.view('uint8') >>> us 0 254 1 255 2 0 3 1 4 2 dtype: uint8
The views share the same underlying values:
>>> us[0] = 128 >>> s 0 -128 1 -1 2 0 3 1 4 2 dtype: int8
- property axes: list[pandas.core.indexes.base.Index]
Return a list of the row axis labels.
- take(indices, axis=0, **kwargs)[source]
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in the index attribute of the object. We are indexing according to the actual position of the element in the object.
- Parameters:
indices (array-like) – An array of ints indicating which positions to take.
axis ({0 or 'index', 1 or 'columns', None}, default 0) – The axis on which to select elements.
0means that we are selecting rows,1means that we are selecting columns. For Series this parameter is unused and defaults to 0.**kwargs – For compatibility with
numpy.take(). Has no effect on the output.
- Returns:
An array-like containing the elements taken from the object.
- Return type:
same type as caller
See also
DataFrame.locSelect a subset of a DataFrame by labels.
DataFrame.ilocSelect a subset of a DataFrame by positions.
numpy.takeTake elements from an array along an axis.
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=['name', 'class', 'max_speed'], ... index=[0, 2, 3, 1]) >>> df name class max_speed 0 falcon bird 389.0 2 parrot bird 24.0 3 lion mammal 80.5 1 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to our selected indices 0 and 3. That’s because we are selecting the 0th and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0, 3]) name class max_speed 0 falcon bird 389.0 1 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1, 2], axis=1) class max_speed 0 bird 389.0 2 bird 24.0 3 mammal 80.5 1 mammal NaN
We may take elements using negative integers for positive indices, starting from the end of the object, just like with Python lists.
>>> df.take([-1, -2]) name class max_speed 1 monkey mammal NaN 3 lion mammal 80.5
- repeat(repeats, axis=None)[source]
Repeat elements of a Series.
Returns a new Series where each element of the current Series is repeated consecutively a given number of times.
- Parameters:
repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Series.
axis (None) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
Newly created Series with repeated elements.
- Return type:
See also
Index.repeatEquivalent function for Index.
numpy.repeatSimilar method for
numpy.ndarray.
Examples
>>> s = pd.Series(['a', 'b', 'c']) >>> s 0 a 1 b 2 c dtype: object >>> s.repeat(2) 0 a 0 a 1 b 1 b 2 c 2 c dtype: object >>> s.repeat([1, 2, 3]) 0 a 1 b 1 b 2 c 2 c 2 c dtype: object
- reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: Literal[False] = False, name: Hashable = _NoDefault.no_default, inplace: Literal[False] = False, allow_duplicates: bool = False) DataFrame[source]
- reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: Literal[True], name: Hashable = _NoDefault.no_default, inplace: Literal[False] = False, allow_duplicates: bool = False) Series
- reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: bool = False, name: Hashable = _NoDefault.no_default, inplace: Literal[True], allow_duplicates: bool = False) None
Generate a new DataFrame or Series with the index reset.
This is useful when the index needs to be treated as a column, or when the index is meaningless and needs to be reset to the default before another operation.
- Parameters:
level (int, str, tuple, or list, default optional) – For a Series with a MultiIndex, only remove the specified levels from the index. Removes all levels by default.
drop (bool, default False) – Just reset the index, without inserting it as a column in the new DataFrame.
name (object, optional) – The name to use for the column containing the original Series values. Uses
self.nameby default. This argument is ignored when drop is True.inplace (bool, default False) – Modify the Series in place (do not create a new object).
allow_duplicates (bool, default False) –
Allow duplicate column labels to be created.
New in version 1.5.0.
- Returns:
When drop is False (the default), a DataFrame is returned. The newly created columns will come first in the DataFrame, followed by the original Series values. When drop is True, a Series is returned. In either case, if
inplace=True, no value is returned.- Return type:
See also
DataFrame.reset_indexAnalogous function for DataFrame.
Examples
>>> s = pd.Series([1, 2, 3, 4], name='foo', ... index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
Generate a DataFrame with default index.
>>> s.reset_index() idx foo 0 a 1 1 b 2 2 c 3 3 d 4
To specify the name of the new column use name.
>>> s.reset_index(name='values') idx values 0 a 1 1 b 2 2 c 3 3 d 4
To generate a new Series with the default set drop to True.
>>> s.reset_index(drop=True) 0 1 1 2 2 3 3 4 Name: foo, dtype: int64
The level parameter is interesting for Series with a multi-level index.
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']), ... np.array(['one', 'two', 'one', 'two'])] >>> s2 = pd.Series( ... range(4), name='foo', ... index=pd.MultiIndex.from_arrays(arrays, ... names=['a', 'b']))
To remove a specific level from the Index, use level.
>>> s2.reset_index(level='a') a foo b one bar 0 two bar 1 one baz 2 two baz 3
If level is not set, all levels are removed from the Index.
>>> s2.reset_index() a b foo 0 bar one 0 1 bar two 1 2 baz one 2 3 baz two 3
- to_string(buf: None = None, na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length=False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) str[source]
- to_string(buf: FilePath | WriteBuffer[str], na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length=False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) None
Render a string representation of the Series.
- Parameters:
buf (StringIO-like, optional) – Buffer to write to.
na_rep (str, optional) – String representation of NaN to use, default ‘NaN’.
float_format (one-parameter function, optional) – Formatter function to apply to columns’ elements if they are floats, default None.
header (bool, default True) – Add the Series header (index name).
index (bool, optional) – Add index (row) labels, default True.
length (bool, default False) – Add the Series length.
dtype (bool, default False) – Add the Series dtype.
name (bool, default False) – Add the Series name if not None.
max_rows (int, optional) – Maximum number of rows to show before truncating. If None, show all.
min_rows (int, optional) – The number of rows to display in a truncated repr (when number of rows is above max_rows).
- Returns:
String representation of Series if
buf=None, otherwise None.- Return type:
str or None
- to_markdown(buf=None, mode='wt', index=True, storage_options=None, **kwargs)[source]
Print Series in Markdown-friendly format.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) –
Add index (row) labels.
New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
**kwargs –
These parameters will be passed to tabulate.
- Returns:
Series in Markdown-friendly format.
- Return type:
Notes
Requires the tabulate package.
- Examples
>>> s = pd.Series(["elk", "pig", "dog", "quetzal"], name="animal") >>> print(s.to_markdown()) | | animal | |---:|:---------| | 0 | elk | | 1 | pig | | 2 | dog | | 3 | quetzal |
Output markdown with a tabulate option.
>>> print(s.to_markdown(tablefmt="grid")) +----+----------+ | | animal | +====+==========+ | 0 | elk | +----+----------+ | 1 | pig | +----+----------+ | 2 | dog | +----+----------+ | 3 | quetzal | +----+----------+
- items()[source]
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.
- Returns:
Iterable of tuples containing the (index, value) pairs from a Series.
- Return type:
iterable
See also
DataFrame.itemsIterate over (column name, Series) pairs.
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
Examples
>>> s = pd.Series(['A', 'B', 'C']) >>> for index, value in s.items(): ... print(f"Index : {index}, Value : {value}") Index : 0, Value : A Index : 1, Value : B Index : 2, Value : C
- to_dict(into=<class 'dict'>)[source]
Convert Series to {label -> value} dict or dict-like object.
- Parameters:
into (class, default dict) – The collections.abc.Mapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
- Returns:
Key-value representation of Series.
- Return type:
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s.to_dict() {0: 1, 1: 2, 2: 3, 3: 4} >>> from collections import OrderedDict, defaultdict >>> s.to_dict(OrderedDict) OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) >>> dd = defaultdict(list) >>> s.to_dict(dd) defaultdict(<class 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
- to_frame(name=_NoDefault.no_default)[source]
Convert Series to DataFrame.
- Parameters:
name (object, optional) – The passed name should substitute for the series name (if it has one).
- Returns:
DataFrame representation of Series.
- Return type:
Examples
>>> s = pd.Series(["a", "b", "c"], ... name="vals") >>> s.to_frame() vals 0 a 1 b 2 c
- groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)[source]
Group Series using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters:
by (mapping, function, label, pd.Grouper or list of such) –
Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.
level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both
byandlevel.as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sort (bool, default True) –
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Changed in version 2.0.0: Specifying
sort=Falsewith an ordered categorical grouper will no longer sort the values.group_keys (bool, default True) –
When calling apply and the
byargument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.Changed in version 1.5.0: Warns that
group_keyswill no longer be ignored when the result fromapplyis a like-indexed Series or DataFrame. Specifygroup_keysexplicitly to include the group keys or not.Changed in version 2.0.0:
group_keysnow defaults toTrue.observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
New in version 1.1.0.
- Returns:
Returns a groupby object that contains information about the groups.
- Return type:
SeriesGroupBy
See also
resampleConvenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Examples
>>> ser = pd.Series([390., 350., 30., 20.], ... index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed") >>> ser Falcon 390.0 Falcon 350.0 Parrot 30.0 Parrot 20.0 Name: Max Speed, dtype: float64 >>> ser.groupby(["a", "b", "a", "b"]).mean() a 210.0 b 185.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level=0).mean() Falcon 370.0 Parrot 25.0 Name: Max Speed, dtype: float64 >>> ser.groupby(ser > 100).mean() Max Speed False 25.0 True 370.0 Name: Max Speed, dtype: float64
Grouping by Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed") >>> ser Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level=0).mean() Animal Falcon 370.0 Parrot 25.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level="Type").mean() Type Captive 210.0 Wild 185.0 Name: Max Speed, dtype: float64
We can also choose to include NA in group keys or not by defining dropna parameter, the default setting is True.
>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan]) >>> ser.groupby(level=0).sum() a 3 b 3 dtype: int64
>>> ser.groupby(level=0, dropna=False).sum() a 3 b 3 NaN 3 dtype: int64
>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot'] >>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed") >>> ser.groupby(["a", "b", "a", np.nan]).mean() a 210.0 b 350.0 Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean() a 210.0 b 350.0 NaN 20.0 Name: Max Speed, dtype: float64
- count()[source]
Return number of non-NA/null observations in the Series.
See also
DataFrame.countCount non-NA cells for each column or row.
Examples
>>> s = pd.Series([0.0, 1.0, np.nan]) >>> s.count() 2
- mode(dropna=True)[source]
Return the mode(s) of the Series.
The mode is the value that appears most often. There can be multiple modes.
Always returns Series even if only one value is returned.
- unique()[source]
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
- Returns:
The unique values returned as a NumPy array. See Notes.
- Return type:
ndarray or ExtensionArray
See also
Series.drop_duplicatesReturn Series with duplicate values removed.
uniqueTop-level unique method for any 1-d array-like object.
Index.uniqueReturn Index with unique values from an Index object.
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArrayof that type with just the unique values is returned. This includesCategorical
Period
Datetime with Timezone
Datetime without Timezone
Timedelta
Interval
Sparse
IntegerNA
See Examples section.
Examples
>>> pd.Series([2, 1, 3, 3], name='A').unique() array([2, 1, 3])
>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique() <DatetimeArray> ['2016-01-01 00:00:00'] Length: 1, dtype: datetime64[ns]
>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern') ... for _ in range(3)]).unique() <DatetimeArray> ['2016-01-01 00:00:00-05:00'] Length: 1, dtype: datetime64[ns, US/Eastern]
An Categorical will return categories in the order of appearance and with the same dtype.
>>> pd.Series(pd.Categorical(list('baabc'))).unique() ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c'] >>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'), ... ordered=True)).unique() ['b', 'a', 'c'] Categories (3, object): ['a' < 'b' < 'c']
- drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: Literal[False] = False, ignore_index: bool = False) Series[source]
- drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: Literal[True], ignore_index: bool = False) None
- drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: bool = False, ignore_index: bool = False) Series | None
Return Series with duplicate values removed.
- Parameters:
keep ({‘first’, ‘last’,
False}, default ‘first’) –Method to handle dropping duplicates:
’first’ : Drop duplicates except for the first occurrence.
’last’ : Drop duplicates except for the last occurrence.
False: Drop all duplicates.
inplace (bool, default
False) – IfTrue, performs operation inplace and returns None.ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.New in version 2.0.0.
- Returns:
Series with duplicates dropped or None if
inplace=True.- Return type:
Series or None
See also
Index.drop_duplicatesEquivalent method on Index.
DataFrame.drop_duplicatesEquivalent method on DataFrame.
Series.duplicatedRelated method on Series, indicating duplicate Series values.
Series.uniqueReturn unique values as an array.
Examples
Generate a Series with duplicated entries.
>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], ... name='animal') >>> s 0 lama 1 cow 2 lama 3 beetle 4 lama 5 hippo Name: animal, dtype: object
With the ‘keep’ parameter, the selection behaviour of duplicated values can be changed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.
>>> s.drop_duplicates() 0 lama 1 cow 3 beetle 5 hippo Name: animal, dtype: object
The value ‘last’ for parameter ‘keep’ keeps the last occurrence for each set of duplicated entries.
>>> s.drop_duplicates(keep='last') 1 cow 3 beetle 4 lama 5 hippo Name: animal, dtype: object
The value
Falsefor parameter ‘keep’ discards all sets of duplicated entries.>>> s.drop_duplicates(keep=False) 1 cow 3 beetle 5 hippo Name: animal, dtype: object
- duplicated(keep='first')[source]
Indicate duplicate Series values.
Duplicated values are indicated as
Truevalues in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.- Parameters:
keep ({'first', 'last', False}, default 'first') –
Method to handle dropping duplicates:
’first’ : Mark duplicates as
Trueexcept for the first occurrence.’last’ : Mark duplicates as
Trueexcept for the last occurrence.False: Mark all duplicates asTrue.
- Returns:
Series indicating whether each value has occurred in the preceding values.
- Return type:
See also
Index.duplicatedEquivalent method on pandas.Index.
DataFrame.duplicatedEquivalent method on pandas.DataFrame.
Series.drop_duplicatesRemove duplicate values from Series.
Examples
By default, for each set of duplicated values, the first occurrence is set on False and all others on True:
>>> animals = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> animals.duplicated() 0 False 1 False 2 True 3 False 4 True dtype: bool
which is equivalent to
>>> animals.duplicated(keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> animals.duplicated(keep='last') 0 True 1 False 2 True 3 False 4 False dtype: bool
By setting keep on
False, all duplicates are True:>>> animals.duplicated(keep=False) 0 True 1 False 2 True 3 False 4 True dtype: bool
- idxmin(axis=0, skipna=True, *args, **kwargs)[source]
Return the row label of the minimum value.
If multiple values equal the minimum, the first row label with that value is returned.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Label of the minimum value.
- Return type:
- Raises:
ValueError – If the Series is empty.
See also
numpy.argminReturn indices of the minimum values along the given axis.
DataFrame.idxminReturn index of first occurrence of minimum over requested axis.
Series.idxmaxReturn index label of the first occurrence of maximum of values.
Notes
This method is the Series version of
ndarray.argmin. This method returns the label of the minimum, whilendarray.argminreturns the position. To get the position, useseries.values.argmin().Examples
>>> s = pd.Series(data=[1, None, 4, 1], ... index=['A', 'B', 'C', 'D']) >>> s A 1.0 B NaN C 4.0 D 1.0 dtype: float64
>>> s.idxmin() 'A'
If skipna is False and there is an NA value in the data, the function returns
nan.>>> s.idxmin(skipna=False) nan
- idxmax(axis=0, skipna=True, *args, **kwargs)[source]
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that value is returned.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Label of the maximum value.
- Return type:
- Raises:
ValueError – If the Series is empty.
See also
numpy.argmaxReturn indices of the maximum values along the given axis.
DataFrame.idxmaxReturn index of first occurrence of maximum over requested axis.
Series.idxminReturn index label of the first occurrence of minimum of values.
Notes
This method is the Series version of
ndarray.argmax. This method returns the label of the maximum, whilendarray.argmaxreturns the position. To get the position, useseries.values.argmax().Examples
>>> s = pd.Series(data=[1, None, 4, 3, 4], ... index=['A', 'B', 'C', 'D', 'E']) >>> s A 1.0 B NaN C 4.0 D 3.0 E 4.0 dtype: float64
>>> s.idxmax() 'C'
If skipna is False and there is an NA value in the data, the function returns
nan.>>> s.idxmax(skipna=False) nan
- round(decimals=0, *args, **kwargs)[source]
Round each value in a Series to the given number of decimals.
- Parameters:
decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Rounded values of the Series.
- Return type:
See also
numpy.aroundRound values of an np.array.
DataFrame.roundRound values of a DataFrame.
Examples
>>> s = pd.Series([0.1, 1.3, 2.7]) >>> s.round() 0 0.0 1 1.0 2 3.0 dtype: float64
- quantile(q: float = 0.5, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') float[source]
- quantile(q: Sequence[float] | ExtensionArray | ndarray | Index | Series, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series
- quantile(q: float | Sequence[float] | ExtensionArray | ndarray | Index | Series = 0.5, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') float | Series
Return value at the given quantile.
- Parameters:
q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
- Returns:
If
qis an array, a Series will be returned where the index isqand the values are the quantiles, otherwise a float will be returned.- Return type:
See also
core.window.Rolling.quantileCalculate the rolling quantile.
numpy.percentileReturns the q-th percentile(s) of the array elements.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s.quantile(.5) 2.5 >>> s.quantile([.25, .5, .75]) 0.25 1.75 0.50 2.50 0.75 3.25 dtype: float64
- corr(other, method='pearson', min_periods=None)[source]
Compute correlation with other Series, excluding missing values.
The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.
- Parameters:
other (Series) – Series with which to compute the correlation.
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method used to compute correlation:
pearson : Standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: Callable with input two 1d ndarrays and returning a float.
Warning
Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
min_periods (int, optional) – Minimum number of observations needed to have a valid result.
- Returns:
Correlation with other.
- Return type:
See also
DataFrame.corrCompute pairwise correlation between columns.
DataFrame.corrwithCompute pairwise correlation with another DataFrame or Series.
Notes
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
Examples
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> s1 = pd.Series([.2, .0, .6, .2]) >>> s2 = pd.Series([.3, .6, .0, .1]) >>> s1.corr(s2, method=histogram_intersection) 0.3
- cov(other, min_periods=None, ddof=1)[source]
Compute covariance with Series, excluding missing values.
The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.
- Parameters:
other (Series) – Series with which to compute the covariance.
min_periods (int, optional) – Minimum number of observations needed to have a valid result.
ddof (int, default 1) –
Delta degrees of freedom. The divisor used in calculations is
N - ddof, whereNrepresents the number of elements.New in version 1.1.0.
- Returns:
Covariance between Series and other normalized by N-1 (unbiased estimator).
- Return type:
See also
DataFrame.covCompute pairwise covariance of columns.
Examples
>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035]) >>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198]) >>> s1.cov(s2) -0.01685762652715874
- diff(periods=1)[source]
First discrete difference of element.
Calculates the difference of a Series element compared with another element in the Series (default is element in previous row).
- Parameters:
periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.
- Returns:
First differences of the Series.
- Return type:
See also
Series.pct_changePercent change over given number of periods.
Series.shiftShift index by desired number of periods with an optional time freq.
DataFrame.diffFirst discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()rather thanoperator.sub(). The result is calculated according to current dtype in Series, however dtype of the result is always float64.Examples
Difference with previous row
>>> s = pd.Series([1, 1, 2, 3, 5, 8]) >>> s.diff() 0 NaN 1 0.0 2 1.0 3 1.0 4 2.0 5 3.0 dtype: float64
Difference with 3rd previous row
>>> s.diff(periods=3) 0 NaN 1 NaN 2 NaN 3 2.0 4 4.0 5 6.0 dtype: float64
Difference with following row
>>> s.diff(periods=-1) 0 0.0 1 -1.0 2 -1.0 3 -2.0 4 -3.0 5 NaN dtype: float64
Overflow in input dtype
>>> s = pd.Series([1, 0], dtype=np.uint8) >>> s.diff() 0 NaN 1 255.0 dtype: float64
- autocorr(lag=1)[source]
Compute the lag-N autocorrelation.
This method computes the Pearson correlation between the Series and its shifted self.
- Parameters:
lag (int, default 1) – Number of lags to apply before performing autocorrelation.
- Returns:
The Pearson correlation between self and self.shift(lag).
- Return type:
See also
Series.corrCompute the correlation between two Series.
Series.shiftShift index by desired number of periods.
DataFrame.corrCompute pairwise correlation of columns.
DataFrame.corrwithCompute pairwise correlation between rows or columns of two DataFrame objects.
Notes
If the Pearson correlation is not well defined return ‘NaN’.
Examples
>>> s = pd.Series([0.25, 0.5, 0.2, -0.05]) >>> s.autocorr() 0.10355... >>> s.autocorr(lag=2) -0.99999...
If the Pearson correlation is not well defined, then ‘NaN’ is returned.
>>> s = pd.Series([1, 0, 0, 0]) >>> s.autocorr() nan
- dot(other)[source]
Compute the dot product between the Series and the columns of other.
This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.
It can also be called using self @ other in Python >= 3.5.
- Parameters:
other (Series, DataFrame or array-like) – The other object to compute the dot product with its columns.
- Returns:
Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.
- Return type:
scalar, Series or numpy.ndarray
See also
DataFrame.dotCompute the matrix product with the DataFrame.
Series.mulMultiplication of series and other, element-wise.
Notes
The Series and other has to share the same index if other is a Series or a DataFrame.
Examples
>>> s = pd.Series([0, 1, 2, 3]) >>> other = pd.Series([-1, 2, -3, 4]) >>> s.dot(other) 8 >>> s @ other 8 >>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]]) >>> s.dot(df) 0 24 1 14 dtype: int64 >>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]]) >>> s.dot(arr) array([24, 14])
- searchsorted(value, side='left', sorter=None)[source]
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted Series self such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.
Note
The Series must be monotonically sorted, otherwise wrong locations will likely be returned. Pandas does not check this for you.
- Parameters:
value (array-like or scalar) – Values to insert into self.
side ({'left', 'right'}, optional) – If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).
sorter (1-D array-like, optional) – Optional array of integer indices that sort self into ascending order. They are typically the result of
np.argsort.
- Returns:
A scalar or array of insertion points with the same shape as value.
- Return type:
See also
sort_valuesSort by the values along either axis.
numpy.searchsortedSimilar method from NumPy.
Notes
Binary search is used to find the required insertion points.
Examples
>>> ser = pd.Series([1, 2, 3]) >>> ser 0 1 1 2 2 3 dtype: int64
>>> ser.searchsorted(4) 3
>>> ser.searchsorted([0, 4]) array([0, 3])
>>> ser.searchsorted([1, 3], side='left') array([0, 2])
>>> ser.searchsorted([1, 3], side='right') array([1, 3])
>>> ser = pd.Series(pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000'])) >>> ser 0 2000-03-11 1 2000-03-12 2 2000-03-13 dtype: datetime64[ns]
>>> ser.searchsorted('3/14/2000') 3
>>> ser = pd.Categorical( ... ['apple', 'bread', 'bread', 'cheese', 'milk'], ordered=True ... ) >>> ser ['apple', 'bread', 'bread', 'cheese', 'milk'] Categories (4, object): ['apple' < 'bread' < 'cheese' < 'milk']
>>> ser.searchsorted('bread') 1
>>> ser.searchsorted(['bread'], side='right') array([3])
If the values are not monotonically sorted, wrong locations may be returned:
>>> ser = pd.Series([2, 1, 3]) >>> ser 0 2 1 1 2 3 dtype: int64
>>> ser.searchsorted(1) 0 # wrong result, correct would be 1
- compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))[source]
Compare to another Series and show the differences.
New in version 1.1.0.
- Parameters:
other (Series) – Object to compare with.
align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
- 1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
result_names (tuple, default ('self', 'other')) –
Set the dataframes names in the comparison.
New in version 1.5.0.
- Returns:
If axis is 0 or ‘index’ the result will be a Series. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
If axis is 1 or ‘columns’ the result will be a DataFrame. It will have two columns namely ‘self’ and ‘other’.
- Return type:
See also
DataFrame.compareCompare with another DataFrame and show differences.
Notes
Matching NaNs will not appear as a difference.
Examples
>>> s1 = pd.Series(["a", "b", "c", "d", "e"]) >>> s2 = pd.Series(["a", "a", "c", "b", "e"])
Align the differences on columns
>>> s1.compare(s2) self other 1 b a 3 d b
Stack the differences on indices
>>> s1.compare(s2, align_axis=0) 1 self b other a 3 self d other b dtype: object
Keep all original rows
>>> s1.compare(s2, keep_shape=True) self other 0 NaN NaN 1 b a 2 NaN NaN 3 d b 4 NaN NaN
Keep all original rows and also all original values
>>> s1.compare(s2, keep_shape=True, keep_equal=True) self other 0 a a 1 b a 2 c c 3 d b 4 e e
- combine(other, func, fill_value=None)[source]
Combine the Series with a Series or scalar according to func.
Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.
- Parameters:
other (Series or scalar) – The value(s) to be combined with the Series.
func (function) – Function that takes two scalars as inputs and returns an element.
fill_value (scalar, optional) – The value to assume when an index is missing from one Series or the other. The default specifies to use the appropriate NaN value for the underlying dtype of the Series.
- Returns:
The result of combining the Series with the other object.
- Return type:
See also
Series.combine_firstCombine Series values, choosing the calling Series’ values first.
Examples
Consider 2 Datasets
s1ands2containing highest clocked speeds of different birds.>>> s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0}) >>> s1 falcon 330.0 eagle 160.0 dtype: float64 >>> s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0}) >>> s2 falcon 345.0 eagle 200.0 duck 30.0 dtype: float64
Now, to combine the two datasets and view the highest speeds of the birds across the two datasets
>>> s1.combine(s2, max) duck NaN eagle 200.0 falcon 345.0 dtype: float64
In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a float is a NaN. So, in the example, we set
fill_value=0, so the maximum value returned will be the value from some dataset.>>> s1.combine(s2, max, fill_value=0) duck 30.0 eagle 200.0 falcon 345.0 dtype: float64
- combine_first(other)[source]
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with non-null values from the other Series. Result index will be the union of the two indexes.
- Parameters:
other (Series) – The value(s) to be used for filling null values.
- Returns:
The result of combining the provided Series with the other object.
- Return type:
See also
Series.combinePerform element-wise operation on two Series using a given function.
Examples
>>> s1 = pd.Series([1, np.nan]) >>> s2 = pd.Series([3, 4, 5]) >>> s1.combine_first(s2) 0 1.0 1 4.0 2 5.0 dtype: float64
Null values still persist if the location of that null value does not exist in other
>>> s1 = pd.Series({'falcon': np.nan, 'eagle': 160.0}) >>> s2 = pd.Series({'eagle': 200.0, 'duck': 30.0}) >>> s1.combine_first(s2) duck 30.0 eagle 160.0 falcon NaN dtype: float64
- update(other)[source]
Modify Series in place using values from passed Series.
Uses non-NA values from passed Series to make updates. Aligns on index.
- Parameters:
other (Series, or object coercible into Series) –
- Return type:
None
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6])) >>> s 0 4 1 5 2 6 dtype: int64
>>> s = pd.Series(['a', 'b', 'c']) >>> s.update(pd.Series(['d', 'e'], index=[0, 2])) >>> s 0 d 1 b 2 e dtype: object
>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6, 7, 8])) >>> s 0 4 1 5 2 6 dtype: int64
If
othercontains NaNs the corresponding values are not updated in the original Series.>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, np.nan, 6])) >>> s 0 4 1 2 2 6 dtype: int64
othercan also be a non-Series object type that is coercible into a Series>>> s = pd.Series([1, 2, 3]) >>> s.update([4, np.nan, 6]) >>> s 0 4 1 2 2 6 dtype: int64
>>> s = pd.Series([1, 2, 3]) >>> s.update({1: 9}) >>> s 0 1 1 9 2 3 dtype: int64
- sort_values(*, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending: bool | int | Sequence[bool] | Sequence[int] = True, inplace: Literal[False] = False, kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) Series[source]
- sort_values(*, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending: bool | int | Sequence[bool] | Sequence[int] = True, inplace: Literal[True], kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) None
Sort by the values.
Sort a Series in ascending or descending order by some criterion.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
ascending (bool or list of bools, default True) – If True, sort values in ascending order, otherwise descending.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.na_position ({'first' or 'last'}, default 'last') – Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) –
If not None, apply the key function to the series values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect aSeriesand return an array-like.New in version 1.1.0.
- Returns:
Series ordered by values or None if
inplace=True.- Return type:
Series or None
See also
Series.sort_indexSort by the Series indices.
DataFrame.sort_valuesSort DataFrame by the values along either axis.
DataFrame.sort_indexSort DataFrame by indices.
Examples
>>> s = pd.Series([np.nan, 1, 3, 10, 5]) >>> s 0 NaN 1 1.0 2 3.0 3 10.0 4 5.0 dtype: float64
Sort values ascending order (default behaviour)
>>> s.sort_values(ascending=True) 1 1.0 2 3.0 4 5.0 3 10.0 0 NaN dtype: float64
Sort values descending order
>>> s.sort_values(ascending=False) 3 10.0 4 5.0 2 3.0 1 1.0 0 NaN dtype: float64
Sort values putting NAs first
>>> s.sort_values(na_position='first') 0 NaN 1 1.0 2 3.0 4 5.0 3 10.0 dtype: float64
Sort a series of strings
>>> s = pd.Series(['z', 'b', 'd', 'a', 'c']) >>> s 0 z 1 b 2 d 3 a 4 c dtype: object
>>> s.sort_values() 3 a 1 b 4 c 2 d 0 z dtype: object
Sort using a key function. Your key function will be given the
Seriesof values and should return an array-like.>>> s = pd.Series(['a', 'B', 'c', 'D', 'e']) >>> s.sort_values() 1 B 3 D 0 a 2 c 4 e dtype: object >>> s.sort_values(key=lambda x: x.str.lower()) 0 a 1 B 2 c 3 D 4 e dtype: object
NumPy ufuncs work well here. For example, we can sort by the
sinof the value>>> s = pd.Series([-4, -2, 0, 2, 4]) >>> s.sort_values(key=np.sin) 1 -2 4 4 2 0 0 -4 3 2 dtype: int64
More complicated user-defined functions can be used, as long as they expect a Series and return an array-like
>>> s.sort_values(key=lambda x: (np.tan(x.cumsum()))) 0 -4 3 2 4 4 1 -2 2 0 dtype: int64
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) None[source]
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) Series
- sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) Series | None
Sort Series by index labels.
Returns a new Series sorted by label if inplace argument is
False, otherwise updates the original series and returns None.- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
level (int, optional) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – If ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape.New in version 1.1.0.
- Returns:
The original Series sorted by the labels or None if
inplace=True.- Return type:
Series or None
See also
DataFrame.sort_indexSort DataFrame by the index.
DataFrame.sort_valuesSort DataFrame by the value.
Series.sort_valuesSort Series by the value.
Examples
>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4]) >>> s.sort_index() 1 c 2 b 3 a 4 d dtype: object
Sort Descending
>>> s.sort_index(ascending=False) 4 d 3 a 2 b 1 c dtype: object
By default NaNs are put at the end, but use na_position to place them at the beginning
>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan]) >>> s.sort_index(na_position='first') NaN d 1.0 c 2.0 b 3.0 a dtype: object
Specify index level to sort
>>> arrays = [np.array(['qux', 'qux', 'foo', 'foo', ... 'baz', 'baz', 'bar', 'bar']), ... np.array(['two', 'one', 'two', 'one', ... 'two', 'one', 'two', 'one'])] >>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=arrays) >>> s.sort_index(level=1) bar one 8 baz one 6 foo one 4 qux one 2 bar two 7 baz two 5 foo two 3 qux two 1 dtype: int64
Does not sort by remaining levels when sorting by levels
>>> s.sort_index(level=1, sort_remaining=False) qux one 2 foo one 4 baz one 6 bar one 8 qux two 1 foo two 3 baz two 5 bar two 7 dtype: int64
Apply a key function before sorting
>>> s = pd.Series([1, 2, 3, 4], index=['A', 'b', 'C', 'd']) >>> s.sort_index(key=lambda x : x.str.lower()) A 1 b 2 C 3 d 4 dtype: int64
- argsort(axis=0, kind='quicksort', order=None)[source]
Return the integer indices that would sort the Series values.
Override ndarray.argsort. Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
kind ({'mergesort', 'quicksort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.order (None) – Has no effect but is accepted for compatibility with numpy.
- Returns:
Positions of values within the sort order with -1 indicating nan values.
- Return type:
Series[np.intp]
See also
numpy.ndarray.argsortReturns the indices that would sort this array.
- nlargest(n=5, keep='first')[source]
Return the largest n elements.
- Parameters:
n (int, default 5) – Return this many descending sorted values.
keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a Series of n elements:
first: return the first n occurrences in order of appearance.last: return the last n occurrences in reverse order of appearance.all: keep all occurrences. This can result in a Series of size larger than n.
- Returns:
The n largest values in the Series, sorted in decreasing order.
- Return type:
See also
Series.nsmallestGet the n smallest elements.
Series.sort_valuesSort Series by values.
Series.headReturn the first n rows.
Notes
Faster than
.sort_values(ascending=False).head(n)for small n relative to the size of theSeriesobject.Examples
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Malta": 434000, "Maldives": 434000, ... "Brunei": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Malta 434000 Maldives 434000 Brunei 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64
The n largest elements where
n=5by default.>>> s.nlargest() France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64
The n largest elements where
n=3. Default keep value is ‘first’ so Malta will be kept.>>> s.nlargest(3) France 65000000 Italy 59000000 Malta 434000 dtype: int64
The n largest elements where
n=3and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order.>>> s.nlargest(3, keep='last') France 65000000 Italy 59000000 Brunei 434000 dtype: int64
The n largest elements where
n=3with all duplicates kept. Note that the returned Series has five elements due to the three duplicates.>>> s.nlargest(3, keep='all') France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64
- nsmallest(n=5, keep='first')[source]
Return the smallest n elements.
- Parameters:
n (int, default 5) – Return this many ascending sorted values.
keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a Series of n elements:
first: return the first n occurrences in order of appearance.last: return the last n occurrences in reverse order of appearance.all: keep all occurrences. This can result in a Series of size larger than n.
- Returns:
The n smallest values in the Series, sorted in increasing order.
- Return type:
See also
Series.nlargestGet the n largest elements.
Series.sort_valuesSort Series by values.
Series.headReturn the first n rows.
Notes
Faster than
.sort_values().head(n)for small n relative to the size of theSeriesobject.Examples
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Brunei": 434000, "Malta": 434000, ... "Maldives": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Brunei 434000 Malta 434000 Maldives 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64
The n smallest elements where
n=5by default.>>> s.nsmallest() Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 Iceland 337000 dtype: int64
The n smallest elements where
n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.>>> s.nsmallest(3) Montserrat 5200 Nauru 11300 Tuvalu 11300 dtype: int64
The n smallest elements where
n=3and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.>>> s.nsmallest(3, keep='last') Montserrat 5200 Anguilla 11300 Tuvalu 11300 dtype: int64
The n smallest elements where
n=3with all duplicates kept. Note that the returned Series has four elements due to the three duplicates.>>> s.nsmallest(3, keep='all') Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 dtype: int64
- swaplevel(i=-2, j=-1, copy=None)[source]
Swap levels i and j in a
MultiIndex.Default is to swap the two innermost levels of the index.
- Parameters:
- Returns:
Series with levels swapped in MultiIndex.
- Return type:
Examples
>>> s = pd.Series( ... ["A", "B", "A", "C"], ... index=[ ... ["Final exam", "Final exam", "Coursework", "Coursework"], ... ["History", "Geography", "History", "Geography"], ... ["January", "February", "March", "April"], ... ], ... ) >>> s Final exam History January A Geography February B Coursework History March A Geography April C dtype: object
In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.
>>> s.swaplevel() Final exam January History A February Geography B Coursework March History A April Geography C dtype: object
By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.
>>> s.swaplevel(0) January History Final exam A February Geography Final exam B March History Coursework A April Geography Coursework C dtype: object
We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.
>>> s.swaplevel(0, 1) History Final exam January A Geography Final exam February B History Coursework March A Geography Coursework April C dtype: object
- reorder_levels(order)[source]
Rearrange index levels using input order.
May not drop or duplicate levels.
- explode(ignore_index=False)[source]
Transform each element of a list-like to a row.
- Parameters:
ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
- Returns:
Exploded lists to rows; index will be duplicated for these rows.
- Return type:
See also
Series.str.splitSplit string values on specified separator.
Series.unstackUnstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.
DataFrame.meltUnpivot a DataFrame from wide format to long format.
DataFrame.explodeExplode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.
Reference the user guide for more examples.
Examples
>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]]) >>> s 0 [1, 2, 3] 1 foo 2 [] 3 [3, 4] dtype: object
>>> s.explode() 0 1 0 2 0 3 1 foo 2 NaN 3 3 3 4 dtype: object
- unstack(level=-1, fill_value=None)[source]
Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
- Parameters:
- Returns:
Unstacked Series.
- Return type:
Notes
Reference the user guide for more examples.
Examples
>>> s = pd.Series([1, 2, 3, 4], ... index=pd.MultiIndex.from_product([['one', 'two'], ... ['a', 'b']])) >>> s one a 1 b 2 two a 3 b 4 dtype: int64
>>> s.unstack(level=-1) a b one 1 2 two 3 4
>>> s.unstack(level=0) one two a 1 3 b 2 4
- map(arg, na_action=None)[source]
Map values of Series according to an input mapping or function.
Used for substituting each value in a Series with another value, that may be derived from a function, a
dictor aSeries.- Parameters:
arg (function, collections.abc.Mapping subclass or Series) – Mapping correspondence.
na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
- Returns:
Same index as caller.
- Return type:
See also
Series.applyFor applying more complex functions on a Series.
DataFrame.applyApply a function row-/column-wise.
DataFrame.applymapApply a function elementwise on a whole DataFrame.
Notes
When
argis a dictionary, values in Series that are not in the dictionary (as keys) are converted toNaN. However, if the dictionary is adictsubclass that defines__missing__(i.e. provides a method for default values), then this default is used rather thanNaN.Examples
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit']) >>> s 0 cat 1 dog 2 NaN 3 rabbit dtype: object
mapaccepts adictor aSeries. Values that are not found in thedictare converted toNaN, unless the dict has a default value (e.g.defaultdict):>>> s.map({'cat': 'kitten', 'dog': 'puppy'}) 0 kitten 1 puppy 2 NaN 3 NaN dtype: object
It also accepts a function:
>>> s.map('I am a {}'.format) 0 I am a cat 1 I am a dog 2 I am a nan 3 I am a rabbit dtype: object
To avoid applying the function to missing values (and keep them as
NaN)na_action='ignore'can be used:>>> s.map('I am a {}'.format, na_action='ignore') 0 I am a cat 1 I am a dog 2 NaN 3 I am a rabbit dtype: object
- aggregate(func=None, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
- Return type:
See also
Series.applyInvoke function on a Series.
Series.transformTransform function producing a Series with like indexes.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.agg('min') 1
>>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- agg(func=None, axis=0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
- Return type:
See also
Series.applyInvoke function on a Series.
Series.transformTransform function producing a Series with like indexes.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.agg('min') 1
>>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: None = ..., **kwargs) bool
- any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: Hashable, **kwargs) Series | bool
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, Series is returned; otherwise, scalar is returned.
- Return type:
scalar or Series
See also
numpy.anyNumpy version of this method.
Series.anyReturn whether any element is True.
Series.allReturn whether all elements are True.
DataFrame.anyReturn whether any element is True over requested axis.
DataFrame.allReturn whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None.>>> df.any(axis=None) True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() Series([], dtype: bool)
- transform(func, axis=0, *args, **kwargs)[source]
Call
funcon self producing a Series with the same axis shape as self.- Parameters:
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a Series or when passed to Series.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
A Series that must have the same length as self.
- Return type:
:raises ValueError : If the returned Series has a different length than self.:
See also
Series.aggOnly perform aggregating type operations.
Series.applyInvoke function on a Series.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4
Even though the resulting Series must have the same length as the input Series, it is possible to provide several input functions:
>>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056
You can call transform on a GroupBy object:
>>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
>>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- apply(func, convert_dtype=True, args=(), **kwargs)[source]
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
- Parameters:
func (function) – Python function or NumPy ufunc to apply.
convert_dtype (bool, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.
args (tuple) – Positional arguments passed to func after the series value.
**kwargs – Additional keyword arguments passed to func.
- Returns:
If func returns a Series object the result will be a DataFrame.
- Return type:
See also
Series.mapFor element-wise operations.
Series.aggOnly perform aggregating type operations.
Series.transformOnly perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
Create a series with typical summer temperatures for each city.
>>> s = pd.Series([20, 21, 12], ... index=['London', 'New York', 'Helsinki']) >>> s London 20 New York 21 Helsinki 12 dtype: int64
Square the values by defining a function and passing it as an argument to
apply().>>> def square(x): ... return x ** 2 >>> s.apply(square) London 400 New York 441 Helsinki 144 dtype: int64
Square the values by passing an anonymous function as an argument to
apply().>>> s.apply(lambda x: x ** 2) London 400 New York 441 Helsinki 144 dtype: int64
Define a custom function that needs additional positional arguments and pass these additional arguments using the
argskeyword.>>> def subtract_custom_value(x, custom_value): ... return x - custom_value
>>> s.apply(subtract_custom_value, args=(5,)) London 15 New York 16 Helsinki 7 dtype: int64
Define a custom function that takes keyword arguments and pass these arguments to
apply.>>> def add_custom_values(x, **kwargs): ... for month in kwargs: ... x += kwargs[month] ... return x
>>> s.apply(add_custom_values, june=30, july=20, august=25) London 95 New York 96 Helsinki 87 dtype: int64
Use a function from the Numpy library.
>>> s.apply(np.log) London 2.995732 New York 3.044522 Helsinki 2.484907 dtype: float64
- align(other, join='outer', axis=None, level=None, copy=None, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)[source]
Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
- Parameters:
join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
fill_axis ({0 or 'index'}, default 0) – Filling axis, method and limit.
broadcast_axis ({0 or 'index'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
- Returns:
Aligned objects.
- Return type:
Examples
>>> df = pd.DataFrame( ... [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2] ... ) >>> other = pd.DataFrame( ... [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]], ... columns=["A", "B", "C", "D"], ... index=[2, 3, 4], ... ) >>> df D B E A 1 1 2 3 4 2 6 7 8 9 >>> other A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Align on columns:
>>> left, right = df.align(other, join="outer", axis=1) >>> left A B C D E 1 4 2 NaN 1 3 2 9 7 NaN 6 8 >>> right A B C D E 2 10 20 30 40 NaN 3 60 70 80 90 NaN 4 600 700 800 900 NaN
We can also align on the index:
>>> left, right = df.align(other, join="outer", axis=0) >>> left D B E A 1 1.0 2.0 3.0 4.0 2 6.0 7.0 8.0 9.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN >>> right A B C D 1 NaN NaN NaN NaN 2 10.0 20.0 30.0 40.0 3 60.0 70.0 80.0 90.0 4 600.0 700.0 800.0 900.0
Finally, the default axis=None will align on both index and columns:
>>> left, right = df.align(other, join="outer", axis=None) >>> left A B C D E 1 4.0 2.0 NaN 1.0 3.0 2 9.0 7.0 NaN 6.0 8.0 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN >>> right A B C D E 1 NaN NaN NaN NaN NaN 2 10.0 20.0 30.0 40.0 NaN 3 60.0 70.0 80.0 90.0 NaN 4 600.0 700.0 800.0 900.0 NaN
- rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: Literal[True], level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') None[source]
- rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: Literal[False] = False, level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') Series
- rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: bool = False, level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') Series | None
Alter Series index labels or name.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
Alternatively, change
Series.namewith a scalar value.See the user guide for more.
- Parameters:
index (scalar, hashable sequence, dict-like or function optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the
Series.nameattribute.axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
copy (bool, default True) – Also copy underlying data.
inplace (bool, default False) – Whether to return a new Series. If True the value of copy is ignored.
level (int or level name, default None) – In case of MultiIndex, only rename labels in the specified level.
errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise KeyError when a dict-like mapper or index contains labels that are not present in the index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
- Returns:
Series with index labels or name altered or None if
inplace=True.- Return type:
Series or None
See also
DataFrame.renameCorresponding DataFrame method.
Series.rename_axisSet the name of the axis.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.rename("my_name") # scalar, changes Series.name 0 1 1 2 2 3 Name: my_name, dtype: int64 >>> s.rename(lambda x: x ** 2) # function, changes labels 0 1 1 2 4 3 dtype: int64 >>> s.rename({1: 3, 2: 5}) # mapping, changes labels 0 1 3 2 5 3 dtype: int64
- set_axis(labels, *, axis=0, copy=None)[source]
Assign desired index to given axis.
Indexes for row labels can be changed by assigning a list-like or Index.
- Parameters:
- Returns:
An object of type Series.
- Return type:
See also
Series.rename_axisAlter the name of the index. Examples ——– >>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3
dtypeint64 >>> s.set_axis([‘a’, ‘b’, ‘c’], axis=0) a 1 b 2 c 3
dtypeint64
- reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)[source]
Conform Series to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and
copy=False.- Parameters:
index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Return type:
Series with changed index.
See also
DataFrame.set_indexSet row labels.
DataFrame.reset_indexRemove row labels or move them to new columns.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
DataFrame.reindexsupports two calling conventions(index=index_labels, columns=column_labels, ...)(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN.>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethodto fill theNaNvalues.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN. If desired, we can fill in the missing values using one of several options.For example, to back-propagate the last valid value to fill the
NaNvalues, passbfillas an argument to themethodkeyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
Please note that the
NaNvalue present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaNvalues present in the original dataframe, use thefillna()method.See the user guide for more.
- rename_axis(mapper=_NoDefault.no_default, *, index=_NoDefault.no_default, axis=0, copy=True, inplace=False)[source]
Set the name of the axis for the index or columns.
- Parameters:
mapper (scalar, list-like, optional) – Value to set the axis name attribute.
index (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columnsparameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapperandaxisto specify the axis to target withmapper, orindexand/orcolumns.columns (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columnsparameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapperandaxisto specify the axis to target withmapper, orindexand/orcolumns.axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For Series this parameter is unused and defaults to 0.
copy (bool, default None) – Also copy underlying data.
inplace (bool, default False) – Modifies the object directly, instead of creating a new Series or DataFrame.
self (Series) –
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
See also
Series.renameAlter Series index labels or name.
DataFrame.renameAlter DataFrame index labels or name.
Index.renameSet new names on index.
Notes
DataFrame.rename_axissupports two calling conventions(index=index_mapper, columns=columns_mapper, ...)(mapper, axis={'index', 'columns'}, ...)
The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter
copyis ignored.The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your intent.
Examples
Series
>>> s = pd.Series(["dog", "cat", "monkey"]) >>> s 0 dog 1 cat 2 monkey dtype: object >>> s.rename_axis("animal") animal 0 dog 1 cat 2 monkey dtype: object
DataFrame
>>> df = pd.DataFrame({"num_legs": [4, 4, 2], ... "num_arms": [0, 0, 2]}, ... ["dog", "cat", "monkey"]) >>> df num_legs num_arms dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("animal") >>> df num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("limbs", axis="columns") >>> df limbs num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2
MultiIndex
>>> df.index = pd.MultiIndex.from_product([['mammal'], ... ['dog', 'cat', 'monkey']], ... names=['type', 'name']) >>> df limbs num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(index={'type': 'class'}) limbs num_legs num_arms class name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(columns=str.upper) LIMBS num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: Literal[True], errors: Literal['ignore', 'raise'] = 'raise') None[source]
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: Literal[False] = False, errors: Literal['ignore', 'raise'] = 'raise') Series
- drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: bool = False, errors: Literal['ignore', 'raise'] = 'raise') Series | None
Return Series with specified index labels removed.
Remove elements of a Series based on specifying the index labels. When using a multi-index, labels on different levels can be removed by specifying the level.
- Parameters:
labels (single label or list-like) – Index labels to drop.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
index (single label or list-like) – Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.
columns (single label or list-like) – No change is made to the Series; use ‘index’ or ‘labels’ instead.
level (int or level name, optional) – For MultiIndex, level for which the labels will be removed.
inplace (bool, default False) – If True, do operation inplace and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
- Returns:
Series with specified index labels removed or None if
inplace=True.- Return type:
Series or None
- Raises:
KeyError – If none of the labels are found in the index.
See also
Series.reindexReturn only specified index labels of Series.
Series.dropnaReturn series without null values.
Series.drop_duplicatesReturn Series with duplicate values removed.
DataFrame.dropDrop specified labels from rows or columns.
Examples
>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C']) >>> s A 0 B 1 C 2 dtype: int64
Drop labels B en C
>>> s.drop(labels=['B', 'C']) A 0 dtype: int64
Drop 2nd level label in MultiIndex Series
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> s = pd.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], ... index=midx) >>> s lama speed 45.0 weight 200.0 length 1.2 cow speed 30.0 weight 250.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3 dtype: float64
>>> s.drop(labels='weight', level=1) lama speed 45.0 length 1.2 cow speed 30.0 length 1.5 falcon speed 320.0 length 0.3 dtype: float64
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[False] = False, limit: int | None = None, downcast: dict | None = None) Series[source]
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[True], limit: int | None = None, downcast: dict | None = None) None
- fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: bool = False, limit: int | None = None, downcast: dict | None = None) Series | None
Fill NA/NaN values using the specified method.
- Parameters:
value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method ({'backfill', 'bfill', 'ffill', None}, default None) –
Method to use for filling holes in reindexed Series:
ffill: propagate last valid observation forward to next valid.
backfill / bfill: use next valid observation to fill gap.
axis ({0 or 'index'}) – Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.
inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
Series or None
See also
interpolateFill NaN values using interpolation.
reindexConform object to new index.
asfreqConvert TimeSeries to specified frequency.
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, np.nan], ... [np.nan, 3, np.nan, 4]], ... columns=list("ABCD")) >>> df A B C D 0 NaN 2.0 NaN 0.0 1 3.0 4.0 NaN 1.0 2 NaN NaN NaN NaN 3 NaN 3.0 NaN 4.0
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0.0 1 3.0 4.0 0.0 1.0 2 0.0 0.0 0.0 0.0 3 0.0 3.0 0.0 4.0
We can also propagate non-null values forward or backward.
>>> df.fillna(method="ffill") A B C D 0 NaN 2.0 NaN 0.0 1 3.0 4.0 NaN 1.0 2 3.0 4.0 NaN 1.0 3 3.0 3.0 NaN 4.0
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {"A": 0, "B": 1, "C": 2, "D": 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0.0 1 3.0 4.0 2.0 1.0 2 0.0 1.0 2.0 3.0 3 0.0 3.0 2.0 4.0
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0.0 1 3.0 4.0 NaN 1.0 2 NaN 1.0 NaN 3.0 3 NaN 3.0 NaN 4.0
When filling using a DataFrame, replacement happens along the same column names and same indices
>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) >>> df.fillna(df2) A B C D 0 0.0 2.0 0.0 0.0 1 3.0 4.0 0.0 1.0 2 0.0 0.0 0.0 NaN 3 0.0 3.0 0.0 4.0
Note that column D is not affected since it is not present in df2.
- pop(item)[source]
Return item and drops from series. Raise KeyError if not found.
- Parameters:
item (label) – Index of the element that needs to be removed.
- Return type:
Value that is popped from series.
Examples
>>> ser = pd.Series([1,2,3])
>>> ser.pop(0) 1
>>> ser 1 2 2 3 dtype: int64
- replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[False] = False, limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) Series[source]
- replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[True], limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) None
Replace values given in to_replace with value.
Values of the Series are replaced with other values dynamically.
This differs from updating with
.locor.iloc, which require you to specify a location to update with some value.- Parameters:
to_replace (str, regex, list, dict, Series, int, float, or None) –
How to find the values that will be replaced.
numeric, str or regex:
numeric: numeric values equal to to_replace will be replaced with value
str: string exactly matching to_replace will be replaced with value
regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
First, if to_replace and value are both lists, they must be the same length.
Second, if
regex=Truethen all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values for different existing values. For example,
{'a': 'b', 'y': 'z'}replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.For a DataFrame a dict can specify that different values should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNonein this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
None:
This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also
Nonethen this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
inplace (bool, default False) – If True, performs operation inplace and returns None.
limit (int, default None) – Maximum size gap to forward or backward fill.
regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular expressions. If this is
Truethen to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone.method ({'pad', 'ffill', 'bfill'}) – The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None.
- Returns:
Object after replacement.
- Return type:
- Raises:
If regex is not a
booland to_replace is notNone.
If to_replace is not a scalar, array-like,
dict, orNone* If to_replace is adictand value is not alist,dict,ndarray, orSeries* If to_replace isNoneand regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. * When replacing multipleboolordatetime64objects and the arguments to to_replace does not match the type of the value being replaced
If a
listor anndarrayis passed to to_replace and value but they are not the same length.
See also
Series.fillnaFill NA values.
Series.whereReplace values based on boolean condition.
Series.str.replaceSimple string replacement.
Notes
Regex substitution is performed under the hood with
re.sub. The rules for substitution forre.subare the same.Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples
Scalar `to_replace` and `value`
>>> s = pd.Series([1, 2, 3, 4, 5]) >>> s.replace(1, 5) 0 5 1 2 2 3 3 4 4 5 dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
List-like `to_replace`
>>> df.replace([0, 1, 2, 3], 4) A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e
>>> s.replace([1, 2], method='bfill') 0 3 1 3 2 3 3 4 4 5 dtype: int64
dict-like `to_replace`
>>> df.replace({0: 10, 1: 100}) A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100) A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}}) A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e
Regular expression `to_replace`
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) A B 0 new abc 1 foo bar 2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new') A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) A B 0 new abc 1 xyz new 2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new') A B 0 new abc 1 new new 2 bait xyz
Compare the behavior of
s.replace({'a': None})ands.replace('a', None)to understand the peculiarities of the to_replace parameter:>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter.
s.replace({'a': None})is equivalent tos.replace(to_replace={'a': None}, value=None, method=None):>>> s.replace({'a': None}) 0 10 1 None 2 None 3 b 4 None dtype: object
When
valueis not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.>>> s.replace('a') 0 10 1 10 2 10 3 b 4 b dtype: object
On the other hand, if
Noneis explicitly passed forvalue, it will be respected:>>> s.replace('a', None) 0 10 1 None 2 None 3 b 4 None dtype: object
Changed in version 1.4.0: Previously the explicit
Nonewas silently ignored.
- info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=True)[source]
Print a concise summary of a Series.
This method prints information about a Series including the index dtype, non-null values and memory usage.
New in version 1.4.0.
- Parameters:
verbose (bool, optional) – Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columnsis followed.buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
memory_usage (bool, str, optional) –
Specifies whether total memory usage of the Series elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usagesetting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.
show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than
pandas.options.display.max_info_rowsandpandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.max_cols (int | None) –
- Returns:
This method prints a summary of a Series and returns None.
- Return type:
None
See also
Series.describeGenerate descriptive statistics of Series.
Series.memory_usageMemory usage of Series.
Examples
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> s = pd.Series(text_values, index=int_values) >>> s.info() <class 'pandas.core.series.Series'> Index: 5 entries, 1 to 5 Series name: None Non-Null Count Dtype -------------- ----- 5 non-null object dtypes: object(1) memory usage: 80.0+ bytes
Prints a summary excluding information about its values:
>>> s.info(verbose=False) <class 'pandas.core.series.Series'> Index: 5 entries, 1 to 5 dtypes: object(1) memory usage: 80.0+ bytes
Pipe output of Series.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> s.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: ... f.write(s) 260
The memory_usage parameter allows deep introspection mode, specially useful for big Series and fine-tune memory optimization:
>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> s = pd.Series(np.random.choice(['a', 'b', 'c'], 10 ** 6)) >>> s.info() <class 'pandas.core.series.Series'> RangeIndex: 1000000 entries, 0 to 999999 Series name: None Non-Null Count Dtype -------------- ----- 1000000 non-null object dtypes: object(1) memory usage: 7.6+ MB
>>> s.info(memory_usage='deep') <class 'pandas.core.series.Series'> RangeIndex: 1000000 entries, 0 to 999999 Series name: None Non-Null Count Dtype -------------- ----- 1000000 non-null object dtypes: object(1) memory usage: 55.3 MB
- shift(periods=1, freq=None, axis=0, fill_value=None)[source]
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
- Parameters:
periods (int) – Number of periods to shift. Can be positive or negative.
freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.
fill_value (object, optional) –
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data,
np.nanis used. For datetime, timedelta, or period data, etc.NaTis used. For extension dtypes,self.dtype.na_valueis used.Changed in version 1.1.0.
- Returns:
Copy of input object, shifted.
- Return type:
See also
Index.shiftShift values of Index.
DatetimeIndex.shiftShift values of DatetimeIndex.
PeriodIndex.shiftShift values of PeriodIndex.
Examples
>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], ... "Col2": [13, 23, 18, 33, 48], ... "Col3": [17, 27, 22, 37, 52]}, ... index=pd.date_range("2020-01-01", "2020-01-05")) >>> df Col1 Col2 Col3 2020-01-01 10 13 17 2020-01-02 20 23 27 2020-01-03 15 18 22 2020-01-04 30 33 37 2020-01-05 45 48 52
>>> df.shift(periods=3) Col1 Col2 Col3 2020-01-01 NaN NaN NaN 2020-01-02 NaN NaN NaN 2020-01-03 NaN NaN NaN 2020-01-04 10.0 13.0 17.0 2020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1, axis="columns") Col1 Col2 Col3 2020-01-01 NaN 10 13 2020-01-02 NaN 20 23 2020-01-03 NaN 15 18 2020-01-04 NaN 30 33 2020-01-05 NaN 45 48
>>> df.shift(periods=3, fill_value=0) Col1 Col2 Col3 2020-01-01 0 0 0 2020-01-02 0 0 0 2020-01-03 0 0 0 2020-01-04 10 13 17 2020-01-05 20 23 27
>>> df.shift(periods=3, freq="D") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
>>> df.shift(periods=3, freq="infer") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
- add(other, level=None, fill_value=None, axis=0)
Return Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.raddReverse of the Addition operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- all(axis=0, bool_only=None, skipna=True, **kwargs)
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, Series is returned; otherwise, scalar is returned.
- Return type:
scalar or Series
See also
Series.allReturn True if all elements are True.
DataFrame.anyReturn True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if values in each column all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'to check if values in each row all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=Nonefor whether every value is True.>>> df.all(axis=None) False
- cummax(axis=None, skipna=True, *args, **kwargs)
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative maximum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.maxSimilar functionality but ignores
NaNvalues.Series.maxReturn the maximum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
- cummin(axis=None, skipna=True, *args, **kwargs)
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative minimum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.minSimilar functionality but ignores
NaNvalues.Series.minReturn the minimum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
- cumprod(axis=None, skipna=True, *args, **kwargs)
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative product of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.prodSimilar functionality but ignores
NaNvalues.Series.prodReturn the product over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
- cumsum(axis=None, skipna=True, *args, **kwargs)
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative sum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.sumSimilar functionality but ignores
NaNvalues.Series.sumReturn the sum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
- divide(other, level=None, fill_value=None, axis=0)
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- divmod(other, level=None, fill_value=None, axis=0)
Return Integer division and modulo of series and other, element-wise (binary operator divmod).
Equivalent to
divmod(series, other), but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
2-Tuple of Series
See also
Series.rdivmodReverse of the Integer division and modulo operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
- eq(other, level=None, fill_value=None, axis=0)
Return Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.eq(b, fill_value=0) a True b False c False d False e False dtype: bool
- floordiv(other, level=None, fill_value=None, axis=0)
Return Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rfloordivReverse of the Integer division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- ge(other, level=None, fill_value=None, axis=0)
Return Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.ge(b, fill_value=0) a True b True c False d False e True f False dtype: bool
- gt(other, level=None, fill_value=None, axis=0)
Return Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.gt(b, fill_value=0) a True b False c False d False e True f False dtype: bool
- kurt(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
- kurtosis(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
- le(other, level=None, fill_value=None, axis=0)
Return Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.le(b, fill_value=0) a False b True c True d False e False f True dtype: bool
- lt(other, level=None, fill_value=None, axis=0)
Return Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.lt(b, fill_value=0) a False b False c True d False e False f True dtype: bool
- max(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax. This is the equivalent of thenumpy.ndarraymethodargmax.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() 8
- mean(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the mean of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
- median(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the median of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
- memory_usage(index=True, deep=False)[source]
Return the memory usage of the Series.
The memory usage can optionally include the contribution of the index and of elements of object dtype.
- Parameters:
- Returns:
Bytes of memory consumed.
- Return type:
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of the array.
DataFrame.memory_usageBytes consumed by a DataFrame.
Examples
>>> s = pd.Series(range(3)) >>> s.memory_usage() 152
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False) 24
The memory footprint of object values is ignored by default:
>>> s = pd.Series(["a", "b"]) >>> s.values array(['a', 'b'], dtype=object) >>> s.memory_usage() 144 >>> s.memory_usage(deep=True) 244
- min(axis=0, skipna=True, numeric_only=False, **kwargs)
Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin. This is the equivalent of thenumpy.ndarraymethodargmin.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() 0
- mod(other, level=None, fill_value=None, axis=0)
Return Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmodReverse of the Modulo operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
- mul(other, level=None, fill_value=None, axis=0)
Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmulReverse of the Multiplication operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- multiply(other, level=None, fill_value=None, axis=0)
Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmulReverse of the Multiplication operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- ne(other, level=None, fill_value=None, axis=0)
Return Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.ne(b, fill_value=0) a False b True c True d True e True dtype: bool
- pow(other, level=None, fill_value=None, axis=0)
Return Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rpowReverse of the Exponential power operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
- prod(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- product(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- radd(other, level=None, fill_value=None, axis=0)
Return Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.addElement-wise Addition, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- rdivmod(other, level=None, fill_value=None, axis=0)
Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).
Equivalent to
other divmod series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
2-Tuple of Series
See also
Series.divmodElement-wise Integer division and modulo, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
- rfloordiv(other, level=None, fill_value=None, axis=0)
Return Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.floordivElement-wise Integer division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- rmod(other, level=None, fill_value=None, axis=0)
Return Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.modElement-wise Modulo, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
- rmul(other, level=None, fill_value=None, axis=0)
Return Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.mulElement-wise Multiplication, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- rpow(other, level=None, fill_value=None, axis=0)
Return Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.powElement-wise Exponential power, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
- rsub(other, level=None, fill_value=None, axis=0)
Return Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.subElement-wise Subtraction, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- rtruediv(other, level=None, fill_value=None, axis=0)
Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.truedivElement-wise Floating division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- sem(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters:
axis ({index (0)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
scalar or Series (if level specified)
- skew(axis=0, skipna=True, numeric_only=False, **kwargs)
Return unbiased skew over requested axis.
Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
- std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
scalar or Series (if level specified)
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
The standard deviation of the columns can be found as follows:
>>> df.std() age 18.786076 height 0.237417 dtype: float64
Alternatively, ddof=0 can be set to normalize by N instead of N-1:
>>> df.std(ddof=0) age 16.269219 height 0.205609 dtype: float64
- sub(other, level=None, fill_value=None, axis=0)
Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rsubReverse of the Subtraction operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- subtract(other, level=None, fill_value=None, axis=0)
Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rsubReverse of the Subtraction operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)
Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.New in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.sum() 14
By default, the sum of an empty or all-NA Series is
0.>>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0
This can be controlled with the
min_countparameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1.>>> pd.Series([], dtype="float64").sum(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() 0.0
>>> pd.Series([np.nan]).sum(min_count=1) nan
- truediv(other, level=None, fill_value=None, axis=0)
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value) –
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- var(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)
Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0)}) – For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
scalar or Series (if level specified)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
>>> df.var() age 352.916667 height 0.056367 dtype: float64
Alternatively,
ddof=0can be set to normalize by N instead of N-1:>>> df.var(ddof=0) age 264.687500 height 0.042275 dtype: float64
- isin(values)[source]
Whether elements in Series are contained in values.
Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
- Parameters:
values (set or list-like) – The sequence of values to test. Passing in a single string will raise a
TypeError. Instead, turn a single string into a list of one element.- Returns:
Series of booleans indicating if each element is in values.
- Return type:
- Raises:
If values is a string
See also
DataFrame.isinEquivalent method on DataFrame.
Examples
>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', ... 'hippo'], name='animal') >>> s.isin(['cow', 'lama']) 0 True 1 True 2 True 3 False 4 True 5 False Name: animal, dtype: bool
To invert the boolean values, use the
~operator:>>> ~s.isin(['cow', 'lama']) 0 False 1 False 2 False 3 True 4 False 5 True Name: animal, dtype: bool
Passing a single string as
s.isin('lama')will raise an error. Use a list of one element instead:>>> s.isin(['lama']) 0 True 1 False 2 True 3 False 4 True 5 False Name: animal, dtype: bool
Strings and integers are distinct and are therefore not comparable:
>>> pd.Series([1]).isin(['1']) 0 False dtype: bool >>> pd.Series([1.1]).isin(['1.1']) 0 False dtype: bool
- between(left, right, inclusive='both')[source]
Return boolean Series equivalent to left <= series <= right.
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
- Parameters:
left (scalar or list-like) – Left boundary.
right (scalar or list-like) – Right boundary.
inclusive ({"both", "neither", "left", "right"}) –
Include boundaries. Whether to set each bound as closed or open.
Changed in version 1.3.0.
- Returns:
Series representing whether each element is between left and right (inclusive).
- Return type:
Notes
This function is equivalent to
(left <= ser) & (ser <= right)Examples
>>> s = pd.Series([2, 0, 4, 8, np.nan])
Boundary values are included by default:
>>> s.between(1, 4) 0 True 1 False 2 True 3 False 4 False dtype: bool
With inclusive set to
"neither"boundary values are excluded:>>> s.between(1, 4, inclusive="neither") 0 True 1 False 2 False 3 False 4 False dtype: bool
left and right can be any scalar value:
>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve']) >>> s.between('Anna', 'Daniel') 0 False 1 True 2 True 3 False dtype: bool
- isna()[source]
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
See also
Series.isnullAlias of isna.
Series.notnaBoolean inverse of isna.
Series.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- isnull()[source]
Series.isnull is an alias for Series.isna.
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
See also
Series.isnullAlias of isna.
Series.notnaBoolean inverse of isna.
Series.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- notna()[source]
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
See also
Series.notnullAlias of notna.
Series.isnaBoolean inverse of notna.
Series.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- notnull()[source]
Series.notnull is an alias for Series.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
See also
Series.notnullAlias of notna.
Series.isnaBoolean inverse of notna.
Series.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- dropna(*, axis: int | Literal['index', 'columns', 'rows'] = 0, inplace: Literal[False] = False, how: Literal['any', 'all'] | None = None, ignore_index: bool = False) Series[source]
- dropna(*, axis: int | Literal['index', 'columns', 'rows'] = 0, inplace: Literal[True], how: Literal['any', 'all'] | None = None, ignore_index: bool = False) None
Return a new Series with missing values removed.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
inplace (bool, default False) – If True, do operation inplace and return None.
how (str, optional) – Not in use. Kept for compatibility.
ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.New in version 2.0.0.
- Returns:
Series with NA entries dropped from it or None if
inplace=True.- Return type:
Series or None
See also
Series.isnaIndicate missing values.
Series.notnaIndicate existing (non-missing) values.
Series.fillnaReplace missing values.
DataFrame.dropnaDrop rows or columns which contain NA values.
Index.dropnaDrop missing indices.
Examples
>>> ser = pd.Series([1., 2., np.nan]) >>> ser 0 1.0 1 2.0 2 NaN dtype: float64
Drop NA values from a Series.
>>> ser.dropna() 0 1.0 1 2.0 dtype: float64
Empty strings are not considered NA values.
Noneis considered an NA value.>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay']) >>> ser 0 NaN 1 2 2 NaT 3 4 None 5 I stay dtype: object >>> ser.dropna() 1 2 3 5 I stay dtype: object
- asfreq(freq, method=None, how=None, normalize=False, fill_value=None)[source]
Convert time series to specified frequency.
Returns the original data conformed to a new index with the specified frequency.
If the index of this Series is a
PeriodIndex, the new index is the result of transforming the original index withPeriodIndex.asfreq(so the original index will map one-to-one to the new index).Otherwise, the new index will be equivalent to
pd.date_range(start, end, freq=freq)wherestartandendare, respectively, the first and last entries in the original index (seepandas.date_range()). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see themethodparameter below).The
resample()method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.- Parameters:
freq (DateOffset or str) – Frequency DateOffset or string.
method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) –
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
’pad’ / ‘ffill’: propagate last valid observation forward to next valid
’backfill’ / ‘bfill’: use NEXT valid observation to fill.
how ({'start', 'end'}, default end) – For PeriodIndex only (see PeriodIndex.asfreq).
normalize (bool, default False) – Whether to reset output index to midnight.
fill_value (scalar, optional) – Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
- Returns:
Series object reindexed to the specified frequency.
- Return type:
See also
reindexConform DataFrame to new index with optional filling logic.
Notes
To learn more about the frequency strings, please see this link.
Examples
Start by creating a series with 4 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=4, freq='T') >>> series = pd.Series([0.0, None, 2.0, 3.0], index=index) >>> df = pd.DataFrame({'s': series}) >>> df s 2000-01-01 00:00:00 0.0 2000-01-01 00:01:00 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:03:00 3.0
Upsample the series into 30 second bins.
>>> df.asfreq(freq='30S') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 NaN 2000-01-01 00:03:00 3.0
Upsample again, providing a
fill value.>>> df.asfreq(freq='30S', fill_value=9.0) s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 9.0 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 9.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 9.0 2000-01-01 00:03:00 3.0
Upsample again, providing a
method.>>> df.asfreq(freq='30S', method='bfill') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 2.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 3.0 2000-01-01 00:03:00 3.0
- resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None, group_keys=False)[source]
Resample time-series data.
Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the
on/levelkeyword parameter.- Parameters:
rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or end of rule.
kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
on (str, optional) – For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
level (str or int, optional) – For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.
origin (Timestamp or str, default 'start_day') –
The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:
’epoch’: origin is 1970-01-01
’start’: origin is the first value of the timeseries
’start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
’end’: origin is the last value of the timeseries
’end_day’: origin is the ceiling midnight of the last day
New in version 1.3.0.
offset (Timedelta or str, default is None) –
An offset timedelta added to the origin.
New in version 1.1.0.
group_keys (bool, default False) –
Whether to include the group keys in the result index when using
.apply()on the resampled object.New in version 1.5.0: Not specifying
group_keyswill retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).Changed in version 2.0.0:
group_keysnow defaults toFalse.
- Returns:
Resamplerobject.- Return type:
pandas.core.Resampler
See also
Series.resampleResample a Series.
DataFrame.resampleResample a DataFrame.
groupbyGroup Series by mapping, function, label, or list of labels.
asfreqReindex a Series with the given frequency without grouping.
Notes
See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] # Select first 5 rows 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaNvalues using theffillmethod.>>> series.resample('30S').ffill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaNvalues using thebfillmethod.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply>>> def custom_resampler(arraylike): ... return np.sum(arraylike) + 5 ... >>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', ... freq='A', ... periods=2)) >>> s 2012 1 2013 2 Freq: A-DEC, dtype: int64 >>> s.resample('Q', convention='start').asfreq() 2012Q1 1.0 2012Q2 NaN 2012Q3 NaN 2012Q4 NaN 2013Q1 2.0 2013Q2 NaN 2013Q3 NaN 2013Q4 NaN Freq: Q-DEC, dtype: float64
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01', ... freq='Q', ... periods=4)) >>> q 2018Q1 1 2018Q2 2 2018Q3 3 2018Q4 4 Freq: Q-DEC, dtype: int64 >>> q.resample('M', convention='end').asfreq() 2018-03 1.0 2018-04 NaN 2018-05 NaN 2018-06 2.0 2018-07 NaN 2018-08 NaN 2018-09 3.0 2018-10 NaN 2018-11 NaN 2018-12 4.0 Freq: M, dtype: float64
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.
>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df = pd.DataFrame(d) >>> df['week_starting'] = pd.date_range('01/01/2018', ... periods=8, ... freq='W') >>> df price volume week_starting 0 10 50 2018-01-07 1 11 60 2018-01-14 2 9 40 2018-01-21 3 13 100 2018-01-28 4 14 50 2018-02-04 5 18 100 2018-02-11 6 17 40 2018-02-18 7 19 50 2018-02-25 >>> df.resample('M', on='week_starting').mean() price volume week_starting 2018-01-31 10.75 62.5 2018-02-28 17.00 60.0
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') >>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df2 = pd.DataFrame( ... d2, ... index=pd.MultiIndex.from_product( ... [days, ['morning', 'afternoon']] ... ) ... ) >>> df2 price volume 2000-01-01 morning 10 50 afternoon 11 60 2000-01-02 morning 9 40 afternoon 13 100 2000-01-03 morning 14 50 afternoon 18 100 2000-01-04 morning 17 40 afternoon 19 50 >>> df2.resample('D', level=0).sum() price volume 2000-01-01 21 110 2000-01-02 22 140 2000-01-03 32 150 2000-01-04 36 90
If you want to adjust the start of the bins based on a fixed timestamp:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00' >>> rng = pd.date_range(start, end, freq='7min') >>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng) >>> ts 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7T, dtype: int64
>>> ts.resample('17min').sum() 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum() 2000-10-01 23:18:00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum() 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.resample('17min', origin='start').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
If you want to take the largest Timestamp as the end of the bins:
>>> ts.resample('17min', origin='end').sum() 2000-10-01 23:35:00 0 2000-10-01 23:52:00 18 2000-10-02 00:09:00 27 2000-10-02 00:26:00 63 Freq: 17T, dtype: int64
In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:
>>> ts.resample('17min', origin='end_day').sum() 2000-10-01 23:38:00 3 2000-10-01 23:55:00 15 2000-10-02 00:12:00 45 2000-10-02 00:29:00 45 Freq: 17T, dtype: int64
- to_timestamp(freq=None, how='start', copy=None)[source]
Cast to DatetimeIndex of Timestamps, at beginning of period.
- Parameters:
freq (str, default frequency of PeriodIndex) – Desired frequency.
how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.
copy (bool, default True) – Whether or not to return a copy.
- Return type:
Series with DatetimeIndex
Examples
>>> idx = pd.PeriodIndex(['2023', '2024', '2025'], freq='Y') >>> s1 = pd.Series([1, 2, 3], index=idx) >>> s1 2023 1 2024 2 2025 3 Freq: A-DEC, dtype: int64
The resulting frequency of the Timestamps is YearBegin
>>> s1 = s1.to_timestamp() >>> s1 2023-01-01 1 2024-01-01 2 2025-01-01 3 Freq: AS-JAN, dtype: int64
Using freq which is the offset that the Timestamps will have
>>> s2 = pd.Series([1, 2, 3], index=idx) >>> s2 = s2.to_timestamp(freq='M') >>> s2 2023-01-31 1 2024-01-31 2 2025-01-31 3 Freq: A-JAN, dtype: int64
- to_period(freq=None, copy=None)[source]
Convert Series from DatetimeIndex to PeriodIndex.
- Parameters:
- Returns:
Series with index converted to PeriodIndex.
- Return type:
Examples
>>> idx = pd.DatetimeIndex(['2023', '2024', '2025']) >>> s = pd.Series([1, 2, 3], index=idx) >>> s = s.to_period() >>> s 2023 1 2024 2 2025 3 Freq: A-DEC, dtype: int64
Viewing the index
>>> s.index PeriodIndex(['2023', '2024', '2025'], dtype='period[A-DEC]')
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) Series[source]
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
- ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) Series | None
Synonym for
DataFrame.fillna()withmethod='ffill'.- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
Series/DataFrame or None
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) Series[source]
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
- bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) Series | None
Synonym for
DataFrame.fillna()withmethod='bfill'.- Returns:
Object with missing values filled or None if
inplace=True.- Return type:
Series/DataFrame or None
- clip(lower=None, upper=None, *, axis=None, inplace=False, **kwargs)[source]
Trim values at input threshold(s).
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
- Parameters:
lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.
inplace (bool, default False) – Whether to perform the operation in place on the data.
*args – Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
self (Series) –
- Returns:
Same type as calling object with the values outside the clip boundaries replaced or None if
inplace=True.- Return type:
See also
Series.clipTrim values at input threshold in series.
DataFrame.clipTrim values at input threshold in dataframe.
numpy.clipClip (limit) the values in an array.
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} >>> df = pd.DataFrame(data) >>> df col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) >>> t 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
Clips using specific lower threshold per column element, with missing values:
>>> t = pd.Series([2, -4, np.NaN, 6, 3]) >>> t 0 2.0 1 -4.0 2 NaN 3 6.0 4 3.0 dtype: float64
>>> df.clip(t, axis=0) col_0 col_1 0 9 2 1 -3 -4 2 0 6 3 6 8 4 5 3
- interpolate(method='linear', *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)[source]
Fill NaN values using an interpolation method.
Please note that only
method='linear'is supported for DataFrame/Series with a MultiIndex.- Parameters:
method (str, default 'linear') –
Interpolation technique to use. One of:
’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
’time’: Works on daily and higher resolution data to interpolate given length of interval.
’index’, ‘values’: use the actual numerical values of the index.
’pad’: Fill in NaNs using existing values.
’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d, whereas ‘spline’ is passed to scipy.interpolate.UnivariateSpline. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g.
df.interpolate(method='polynomial', order=5). Note that, slinear method in Pandas refers to the Scipy first order spline instead of Pandas first order spline.’krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
’from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Axis to interpolate along. For Series this parameter is unused and defaults to 0.
limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.
inplace (bool, default False) – Update the data in place if possible.
limit_direction ({{'forward', 'backward', 'both'}}, Optional) –
Consecutive NaNs will be filled in this direction.
- If limit is specified:
If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.
- If ‘limit’ is not specified:
If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
else the default is ‘forward’
Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.
limit_area ({{None, ‘inside’, ‘outside’}}, default None) –
If limit is specified, consecutive NaNs will be filled with this restriction.
None: No fill restriction.’inside’: Only fill NaNs surrounded by valid values (interpolate).
’outside’: Only fill NaNs outside valid values (extrapolate).
downcast (optional, 'infer' or None, defaults to None) – Downcast dtypes if possible.
**kwargs (optional) – Keyword arguments to pass on to the interpolating function.
self (Series) –
- Returns:
Returns the same object type as the caller, interpolated at some or all
NaNvalues or None ifinplace=True.- Return type:
See also
fillnaFill missing values using different methods.
scipy.interpolate.Akima1DInterpolatorPiecewise cubic polynomials (Akima interpolator).
scipy.interpolate.BPoly.from_derivativesPiecewise polynomial in the Bernstein basis.
scipy.interpolate.interp1dInterpolate a 1-D function.
scipy.interpolate.KroghInterpolatorInterpolate polynomial (Krogh interpolator).
scipy.interpolate.PchipInterpolatorPCHIP 1-d monotonic cubic interpolation.
scipy.interpolate.CubicSplineCubic spline data interpolator.
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation.
Examples
Filling in
NaNin aSeriesvia linear interpolation.>>> s = pd.Series([0, 1, np.nan, 3]) >>> s 0 0.0 1 1.0 2 NaN 3 3.0 dtype: float64 >>> s.interpolate() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Filling in
NaNin a Series by padding, but filling at most two consecutiveNaNat a time.>>> s = pd.Series([np.nan, "single_one", np.nan, ... "fill_two_more", np.nan, np.nan, np.nan, ... 4.71, np.nan]) >>> s 0 NaN 1 single_one 2 NaN 3 fill_two_more 4 NaN 5 NaN 6 NaN 7 4.71 8 NaN dtype: object >>> s.interpolate(method='pad', limit=2) 0 NaN 1 single_one 2 single_one 3 fill_two_more 4 fill_two_more 5 fill_two_more 6 NaN 7 4.71 8 4.71 dtype: object
Filling in
NaNin a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods require that you also specify anorder(int).>>> s = pd.Series([0, 2, np.nan, 8]) >>> s.interpolate(method='polynomial', order=2) 0 0.000000 1 2.000000 2 4.666667 3 8.000000 dtype: float64
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains
NaN, because there is no entry before it to use for interpolation.>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0), ... (np.nan, 2.0, np.nan, np.nan), ... (2.0, 3.0, np.nan, 9.0), ... (np.nan, 4.0, -4.0, 16.0)], ... columns=list('abcd')) >>> df a b c d 0 0.0 NaN -1.0 1.0 1 NaN 2.0 NaN NaN 2 2.0 3.0 NaN 9.0 3 NaN 4.0 -4.0 16.0 >>> df.interpolate(method='linear', limit_direction='forward', axis=0) a b c d 0 0.0 NaN -1.0 1.0 1 1.0 2.0 -2.0 5.0 2 2.0 3.0 -3.0 9.0 3 2.0 4.0 -4.0 16.0
Using polynomial interpolation.
>>> df['d'].interpolate(method='polynomial', order=2) 0 1.0 1 4.0 2 9.0 3 16.0 Name: d, dtype: float64
- where(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series[source]
- where(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
- where(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series | None
Replace values where the condition is False.
- Parameters:
cond (bool Series/DataFrame, array-like, or callable) – Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
other (scalar, Series/DataFrame, or callable) – Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (
np.nanfor numpy dtypes,pd.NAfor extension dtypes).inplace (bool, default False) – Whether to perform the operation in place on the data.
axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.
level (int, default None) – Alignment level if needed.
- Return type:
Same type as caller or None if
inplace=True.
See also
DataFrame.mask()Return an object of same shape as self.
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
condisTruethe element is used; otherwise the corresponding element from the DataFrameotheris used. If the axis ofotherdoes not align with axis ofcondSeries/DataFrame, the misaligned index positions will be filled with False.The signature for
DataFrame.where()differs fromnumpy.where(). Roughlydf1.where(m, df2)is equivalent tonp.where(m, df1, df2).For further details and examples see the
wheredocumentation in indexing.The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
>>> s = pd.Series(range(5)) >>> t = pd.Series([True, False]) >>> s.where(t, 99) 0 0 1 99 2 99 3 99 4 99 dtype: int64 >>> s.mask(t, 99) 0 99 1 1 2 99 3 99 4 99 dtype: int64
>>> s.where(s > 1, 10) 0 10 1 10 2 2 3 3 4 4 dtype: int64 >>> s.mask(s > 1, 10) 0 0 1 1 2 10 3 10 4 10 dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> df A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
- mask(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series[source]
- mask(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
- mask(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series | None
Replace values where the condition is True.
- Parameters:
cond (bool Series/DataFrame, array-like, or callable) – Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
other (scalar, Series/DataFrame, or callable) – Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (
np.nanfor numpy dtypes,pd.NAfor extension dtypes).inplace (bool, default False) – Whether to perform the operation in place on the data.
axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.
level (int, default None) – Alignment level if needed.
- Return type:
Same type as caller or None if
inplace=True.
See also
DataFrame.where()Return an object of same shape as self.
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
condisFalsethe element is used; otherwise the corresponding element from the DataFrameotheris used. If the axis ofotherdoes not align with axis ofcondSeries/DataFrame, the misaligned index positions will be filled with True.The signature for
DataFrame.where()differs fromnumpy.where(). Roughlydf1.where(m, df2)is equivalent tonp.where(m, df1, df2).For further details and examples see the
maskdocumentation in indexing.The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.
Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
>>> s = pd.Series(range(5)) >>> t = pd.Series([True, False]) >>> s.where(t, 99) 0 0 1 99 2 99 3 99 4 99 dtype: int64 >>> s.mask(t, 99) 0 99 1 1 2 99 3 99 4 99 dtype: int64
>>> s.where(s > 1, 10) 0 10 1 10 2 2 3 3 4 4 dtype: int64 >>> s.mask(s > 1, 10) 0 0 1 1 2 10 3 10 4 10 dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> df A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
- index
The index (axis labels) of the Series.
- str
alias of
StringMethods
- dt
alias of
CombinedDatetimelikeProperties
- cat
alias of
CategoricalAccessor
- plot
alias of
PlotAccessor
- sparse
alias of
SparseAccessor
- hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, backend=None, legend=False, **kwargs)
Draw histogram of the input series using matplotlib.
- Parameters:
by (object, optional) – If passed, then used to form histograms for separate groups.
ax (matplotlib axis object) – If not passed, uses gca().
grid (bool, default True) – Whether to show axis grid lines.
xlabelsize (int, default None) – If specified changes the x-axis label size.
xrot (float, default None) – Rotation of x axis labels.
ylabelsize (int, default None) – If specified changes the y-axis label size.
yrot (float, default None) – Rotation of y axis labels.
figsize (tuple, default None) – Figure size in inches by default.
bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.legend (bool, default False) –
Whether to show the legend.
New in version 1.1.0.
**kwargs – To be passed to the actual plotting function.
- Returns:
A histogram plot.
- Return type:
matplotlib.AxesSubplot
See also
matplotlib.axes.Axes.histPlot a histogram using matplotlib.
- class pandas.SparseDtype[source]
Dtype for data stored in
SparseArray.This dtype implements the pandas ExtensionDtype interface.
- Parameters:
dtype (str, ExtensionDtype, numpy.dtype, type, default numpy.float64) – The dtype of the underlying array storing the non-fill value values.
fill_value (scalar, optional) –
The scalar value not stored in the SparseArray. By default, this depends on dtype.
dtype
na_value
float
np.nanint
0bool
Falsedatetime64
pd.NaTtimedelta64
pd.NaTThe default value may be overridden by specifying a fill_value.
- None
- None()
- property fill_value
The fill value of the array.
Converting the SparseArray to a dense ndarray will fill the array with this value.
Warning
It’s possible to end up with a SparseArray that has
fill_valuevalues insp_values. This can occur, for example, when settingSparseArray.fill_valuedirectly.
- property type
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- property subtype
- property name: str
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- classmethod construct_array_type()[source]
Return the array type associated with this dtype.
- Return type:
- classmethod construct_from_string(string)[source]
Construct a SparseDtype from a string form.
- Parameters:
string (str) –
Can take the following forms.
string dtype ================ ============================ ‘int’ SparseDtype[np.int64, 0] ‘Sparse’ SparseDtype[np.float64, nan] ‘Sparse[int]’ SparseDtype[np.int64, 0] ‘Sparse[int, 0]’ SparseDtype[np.int64, 0] ================ ============================
It is not possible to specify non-default fill values with a string. An argument like
'Sparse[int, 1]'will raise aTypeErrorbecause the default fill value for integers is 0.- Return type:
- classmethod is_dtype(dtype)[source]
Check if we match ‘dtype’.
Notes
The default implementation is True if
cls.construct_from_string(dtype)is an instance ofcls.dtypeis an object and is an instance ofclsdtypehas adtypeattribute, and any of the above conditions is true fordtype.dtype.
- update_dtype(dtype)[source]
Convert the SparseDtype to a new dtype.
This takes care of converting the
fill_value.- Parameters:
dtype (Union[str, numpy.dtype, SparseDtype]) –
The new dtype to use.
For a SparseDtype, it is simply returned
For a NumPy dtype (or str), the current fill value is converted to the new dtype, and a SparseDtype with dtype and the new fill value is returned.
- Returns:
A new SparseDtype with the correct dtype and fill value for that dtype.
- Return type:
- Raises:
ValueError – When the current fill value cannot be converted to the new dtype (e.g. trying to convert
np.nanto an integer dtype).
Examples
>>> SparseDtype(int, 0).update_dtype(float) Sparse[float64, 0.0]
>>> SparseDtype(int, 1).update_dtype(SparseDtype(float, np.nan)) Sparse[float64, nan]
- class pandas.StringDtype[source]
Extension dtype for string data.
Warning
StringDtype is considered experimental. The implementation and parts of the API may change without warning.
- Parameters:
storage ({"python", "pyarrow"}, optional) – If not given, the value of
pd.options.mode.string_storage.
- None
- None()
Examples
>>> pd.StringDtype() string[python]
>>> pd.StringDtype(storage="pyarrow") string[pyarrow]
- property na_value: NAType
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- property type: type[str]
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- classmethod construct_from_string(string)[source]
Construct a StringDtype from a string.
- Parameters:
string (str) –
The type of the name. The storage type will be taking from string. Valid options and their storage types are
string
result storage
'string'pd.options.mode.string_storage, default python
'string[python]'python
'string[pyarrow]'pyarrow
- Return type:
- Raises:
TypeError – If the string is not a valid option.
- class pandas.Timedelta
Represents a duration, the difference between two dates or times.
Timedelta is the pandas equivalent of python’s
datetime.timedeltaand is interchangeable with it in most cases.- Parameters:
unit (str, default 'ns') –
Denote the unit of the input, if input is an integer.
Possible values:
’W’, ‘D’, ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’
’days’ or ‘day’
’hours’, ‘hour’, ‘hr’, or ‘h’
’minutes’, ‘minute’, ‘min’, or ‘m’
’seconds’, ‘second’, or ‘sec’
’milliseconds’, ‘millisecond’, ‘millis’, or ‘milli’
’microseconds’, ‘microsecond’, ‘micros’, or ‘micro’
’nanoseconds’, ‘nanosecond’, ‘nanos’, ‘nano’, or ‘ns’.
**kwargs – Available kwargs: {days, seconds, microseconds, milliseconds, minutes, hours, weeks}. Values for construction in compat with datetime.timedelta. Numpy ints and floats will be coerced to python ints and floats.
Notes
The constructor may take in either both values of value and unit or kwargs as above. Either one of them must be used during initialization
The
.valueattribute is always in ns.If the precision is higher than nanoseconds, the precision of the duration is truncated to nanoseconds.
Examples
Here we initialize Timedelta object with both value and unit
>>> td = pd.Timedelta(1, "d") >>> td Timedelta('1 days 00:00:00')
Here we initialize the Timedelta object with kwargs
>>> td2 = pd.Timedelta(days=1) >>> td2 Timedelta('1 days 00:00:00')
We see that either way we get the same result
- ceil(freq)
Return a new Timedelta ceiled to this resolution.
- Parameters:
freq (str) – Frequency string indicating the ceiling resolution.
- class pandas.TimedeltaIndex[source]
Immutable Index of timedelta64 data.
Represented internally as int64, and scalars returned Timedelta objects.
- Parameters:
data (array-like (1-dimensional), optional) – Optional timedelta-like data to construct index with.
unit (unit of the arg (D,h,m,s,ms,us,ns) denote the unit, optional) – Which is an integer/float number.
freq (str or pandas offset object, optional) – One of pandas date offset strings or corresponding objects. The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation.
copy (bool) – Make a copy of input ndarray.
name (object) – Name to be stored in the index.
- days
- seconds
- microseconds
- nanoseconds
- components
- inferred_freq
- to_pytimedelta()
- to_series()
- round()
- floor()
- ceil()
- to_frame()
- mean()
See also
IndexThe base pandas Index type.
TimedeltaRepresents a duration between two dates or times.
DatetimeIndexIndex of datetime64 data.
PeriodIndexIndex of Period data.
timedelta_rangeCreate a fixed-frequency TimedeltaIndex.
Notes
To learn more about the frequency strings, please see this link.
- ceil(*args, **kwargs)
Perform ceil operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to ceil the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.ceil('H') DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00', '2018-01-01 13:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.ceil("H") 0 2018-01-01 12:00:00 1 2018-01-01 12:00:00 2 2018-01-01 13:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 01:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.ceil("H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.ceil("H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- property components
Return a DataFrame of the individual resolution components of the Timedeltas.
The components (days, hours, minutes seconds, milliseconds, microseconds, nanoseconds) are returned as columns in a DataFrame.
- Return type:
- property days
Number of days for each element.
- floor(*args, **kwargs)
Perform floor operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.floor('H') DatetimeIndex(['2018-01-01 11:00:00', '2018-01-01 12:00:00', '2018-01-01 12:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.floor("H") 0 2018-01-01 11:00:00 1 2018-01-01 12:00:00 2 2018-01-01 12:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- median(*args, **kwargs)
- property microseconds
Number of microseconds (>= 0 and less than 1 second) for each element.
- property nanoseconds
Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.
- round(*args, **kwargs)
Perform round operation on the data to the specified freq.
- Parameters:
freq (str or Offset) – The frequency level to round the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
Only relevant for DatetimeIndex:
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.
- Return type:
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the timestamps have a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
DatetimeIndex
>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min') >>> rng DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00', '2018-01-01 12:01:00'], dtype='datetime64[ns]', freq='T') >>> rng.round('H') DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00', '2018-01-01 12:00:00'], dtype='datetime64[ns]', freq=None)
Series
>>> pd.Series(rng).dt.round("H") 0 2018-01-01 12:00:00 1 2018-01-01 12:00:00 2 2018-01-01 12:00:00 dtype: datetime64[ns]
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False) DatetimeIndex(['2021-10-31 02:00:00+01:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True) DatetimeIndex(['2021-10-31 02:00:00+02:00'], dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
- property seconds
Number of seconds (>= 0 and less than 1 day) for each element.
- std(*args, **kwargs)
- sum(*args, **kwargs)
- to_pytimedelta(*args, **kwargs)
Return an ndarray of datetime.timedelta objects.
- Return type:
numpy.ndarray
- total_seconds(*args, **kwargs)
Return total duration of each element expressed in seconds.
This method is available directly on TimedeltaArray, TimedeltaIndex and on Series containing timedelta values under the
.dtnamespace.- Returns:
When the calling object is a TimedeltaArray, the return type is ndarray. When the calling object is a TimedeltaIndex, the return type is an Index with a float64 dtype. When the calling object is a Series, the return type is Series of type float64 whose index is the same as the original.
- Return type:
See also
datetime.timedelta.total_secondsStandard library version of this method.
TimedeltaIndex.componentsReturn a DataFrame with components of each Timedelta.
Examples
Series
>>> s = pd.Series(pd.to_timedelta(np.arange(5), unit='d')) >>> s 0 0 days 1 1 days 2 2 days 3 3 days 4 4 days dtype: timedelta64[ns]
>>> s.dt.total_seconds() 0 0.0 1 86400.0 2 172800.0 3 259200.0 4 345600.0 dtype: float64
TimedeltaIndex
>>> idx = pd.to_timedelta(np.arange(5), unit='d') >>> idx TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
>>> idx.total_seconds() Index([0.0, 86400.0, 172800.0, 259200.0, 345600.0], dtype='float64')
- class pandas.Timestamp
Pandas replacement for python datetime.datetime object.
Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.
- Parameters:
ts_input (datetime-like, str, int, float) – Value to be converted to Timestamp.
year (int) –
month (int) –
day (int) –
hour (int, optional, default 0) –
minute (int, optional, default 0) –
second (int, optional, default 0) –
microsecond (int, optional, default 0) –
tzinfo (datetime.tzinfo, optional, default None) –
nanosecond (int, optional, default 0) –
tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will have.
unit (str) –
Unit used for conversion if ts_input is of type int or float. The valid values are ‘D’, ‘h’, ‘m’, ‘s’, ‘ms’, ‘us’, and ‘ns’. For example, ‘s’ means seconds and ‘ms’ means milliseconds.
For float inputs, the result will be stored in nanoseconds, and the unit attribute will be set as
'ns'.fold ({0, 1}, default None, keyword-only) –
Due to daylight saving time, one wall clock time can occur twice when shifting from summer to winter time; fold describes whether the datetime-like corresponds to the first (0) or the second time (1) the wall clock hits the ambiguous time.
New in version 1.1.0.
Notes
There are essentially three calling conventions for the constructor. The primary form accepts four parameters. They can be passed by position or keyword.
The other two forms mimic the parameters from
datetime.datetime. They can be passed by either position or keyword, but not both mixed together.Examples
Using the primary calling convention:
This converts a datetime-like string
>>> pd.Timestamp('2017-01-01T12') Timestamp('2017-01-01 12:00:00')
This converts a float representing a Unix epoch in units of seconds
>>> pd.Timestamp(1513393355.5, unit='s') Timestamp('2017-12-16 03:02:35.500000')
This converts an int representing a Unix-epoch in units of seconds and for a particular timezone
>>> pd.Timestamp(1513393355, unit='s', tz='US/Pacific') Timestamp('2017-12-15 19:02:35-0800', tz='US/Pacific')
Using the other two forms that mimic the API for
datetime.datetime:>>> pd.Timestamp(2017, 1, 1, 12) Timestamp('2017-01-01 12:00:00')
>>> pd.Timestamp(year=2017, month=1, day=1, hour=12) Timestamp('2017-01-01 12:00:00')
- astimezone(tz)
Convert timezone-aware Timestamp to another time zone.
- Parameters:
tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding UTC time.
- Returns:
converted
- Return type:
- Raises:
TypeError – If Timestamp is tz-naive.
Examples
Create a timestamp object with UTC timezone:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC') >>> ts Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')
Change to Tokyo timezone:
>>> ts.tz_convert(tz='Asia/Tokyo') Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')
Can also use
astimezone:>>> ts.astimezone(tz='Asia/Tokyo') Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')
Analogous for
pd.NaT:>>> pd.NaT.tz_convert(tz='Asia/Tokyo') NaT
- ceil(freq, ambiguous='raise', nonexistent='raise')
Return a new Timestamp ceiled to this resolution.
- Parameters:
freq (str) – Frequency string indicating the ceiling resolution.
ambiguous (bool or {'raise', 'NaT'}, default 'raise') –
The behavior is as follows:
bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).
’NaT’ will return NaT for an ambiguous time.
’raise’ will raise an AmbiguousTimeError for an ambiguous time.
nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time.
’shift_backward’ will shift the nonexistent time backward to the closest existing time.
’NaT’ will return NaT where there are nonexistent times.
timedelta objects will shift nonexistent times by the timedelta.
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the Timestamp has a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
Create a timestamp object:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
A timestamp can be ceiled using multiple frequency units:
>>> ts.ceil(freq='H') # hour Timestamp('2020-03-14 16:00:00')
>>> ts.ceil(freq='T') # minute Timestamp('2020-03-14 15:33:00')
>>> ts.ceil(freq='S') # seconds Timestamp('2020-03-14 15:32:53')
>>> ts.ceil(freq='U') # microseconds Timestamp('2020-03-14 15:32:52.192549')
freqcan also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):>>> ts.ceil(freq='5T') Timestamp('2020-03-14 15:35:00')
or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):
>>> ts.ceil(freq='1H30T') Timestamp('2020-03-14 16:30:00')
Analogous for
pd.NaT:>>> pd.NaT.ceil() NaT
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> ts_tz = pd.Timestamp("2021-10-31 01:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.ceil("H", ambiguous=False) Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.ceil("H", ambiguous=True) Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
- classmethod combine(date, time)
Combine date, time into datetime with same date and time fields.
Examples
>>> from datetime import date, time >>> pd.Timestamp.combine(date(2020, 3, 14), time(15, 30, 15)) Timestamp('2020-03-14 15:30:15')
- daysinmonth
Return the number of days in the month.
- Return type:
Examples
>>> ts = pd.Timestamp(2020, 3, 14) >>> ts.days_in_month 31
- floor(freq, ambiguous='raise', nonexistent='raise')
Return a new Timestamp floored to this resolution.
- Parameters:
freq (str) – Frequency string indicating the flooring resolution.
ambiguous (bool or {'raise', 'NaT'}, default 'raise') –
The behavior is as follows:
bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).
’NaT’ will return NaT for an ambiguous time.
’raise’ will raise an AmbiguousTimeError for an ambiguous time.
nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time.
’shift_backward’ will shift the nonexistent time backward to the closest existing time.
’NaT’ will return NaT where there are nonexistent times.
timedelta objects will shift nonexistent times by the timedelta.
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Raises:
ValueError if the freq cannot be converted. –
Notes
If the Timestamp has a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
Create a timestamp object:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
A timestamp can be floored using multiple frequency units:
>>> ts.floor(freq='H') # hour Timestamp('2020-03-14 15:00:00')
>>> ts.floor(freq='T') # minute Timestamp('2020-03-14 15:32:00')
>>> ts.floor(freq='S') # seconds Timestamp('2020-03-14 15:32:52')
>>> ts.floor(freq='N') # nanoseconds Timestamp('2020-03-14 15:32:52.192548651')
freqcan also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):>>> ts.floor(freq='5T') Timestamp('2020-03-14 15:30:00')
or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):
>>> ts.floor(freq='1H30T') Timestamp('2020-03-14 15:00:00')
Analogous for
pd.NaT:>>> pd.NaT.floor() NaT
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> ts_tz = pd.Timestamp("2021-10-31 03:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.floor("2H", ambiguous=False) Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.floor("2H", ambiguous=True) Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
- classmethod fromordinal(ordinal, tz=None)
Construct a timestamp from a a proleptic Gregorian ordinal.
- Parameters:
Notes
By definition there cannot be any tz info on the ordinal itself.
Examples
>>> pd.Timestamp.fromordinal(737425) Timestamp('2020-01-01 00:00:00')
- classmethod fromtimestamp(ts)
Transform timestamp[, tz] to tz’s local time from POSIX timestamp.
Examples
>>> pd.Timestamp.fromtimestamp(1584199972) Timestamp('2020-03-14 15:32:52')
Note that the output may change depending on your local time.
- isoweekday()
Return the day of the week represented by the date.
Monday == 1 … Sunday == 7.
- classmethod now(tz=None)
Return new Timestamp object representing current time local to tz.
- Parameters:
tz (str or timezone object, default None) – Timezone to localize to.
Examples
>>> pd.Timestamp.now() Timestamp('2020-11-16 22:06:16.378782')
Analogous for
pd.NaT:>>> pd.NaT.now() NaT
- replace(year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=<class 'object'>, fold=None)
Implements datetime.replace, handles nanoseconds.
- Parameters:
- Return type:
Timestamp with fields replaced
Examples
Create a timestamp object:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC') >>> ts Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')
Replace year and the hour:
>>> ts.replace(year=1999, hour=10) Timestamp('1999-03-14 10:32:52.192548651+0000', tz='UTC')
Replace timezone (not a conversion):
>>> import pytz >>> ts.replace(tzinfo=pytz.timezone('US/Pacific')) Timestamp('2020-03-14 15:32:52.192548651-0700', tz='US/Pacific')
Analogous for
pd.NaT:>>> pd.NaT.replace(tzinfo=pytz.timezone('US/Pacific')) NaT
- round(freq, ambiguous='raise', nonexistent='raise')
Round the Timestamp to the specified resolution.
- Parameters:
freq (str) – Frequency string indicating the rounding resolution.
ambiguous (bool or {'raise', 'NaT'}, default 'raise') –
The behavior is as follows:
bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).
’NaT’ will return NaT for an ambiguous time.
’raise’ will raise an AmbiguousTimeError for an ambiguous time.
nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
’shift_forward’ will shift the nonexistent time forward to the closest existing time.
’shift_backward’ will shift the nonexistent time backward to the closest existing time.
’NaT’ will return NaT where there are nonexistent times.
timedelta objects will shift nonexistent times by the timedelta.
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Return type:
a new Timestamp rounded to the given resolution of freq
- Raises:
ValueError if the freq cannot be converted –
Notes
If the Timestamp has a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use
nonexistentandambiguousto control the re-localization behavior.Examples
Create a timestamp object:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
A timestamp can be rounded using multiple frequency units:
>>> ts.round(freq='H') # hour Timestamp('2020-03-14 16:00:00')
>>> ts.round(freq='T') # minute Timestamp('2020-03-14 15:33:00')
>>> ts.round(freq='S') # seconds Timestamp('2020-03-14 15:32:52')
>>> ts.round(freq='L') # milliseconds Timestamp('2020-03-14 15:32:52.193000')
freqcan also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):>>> ts.round(freq='5T') Timestamp('2020-03-14 15:35:00')
or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):
>>> ts.round(freq='1H30T') Timestamp('2020-03-14 15:00:00')
Analogous for
pd.NaT:>>> pd.NaT.round() NaT
When rounding near a daylight savings time transition, use
ambiguousornonexistentto control how the timestamp should be re-localized.>>> ts_tz = pd.Timestamp("2021-10-31 01:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.round("H", ambiguous=False) Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.round("H", ambiguous=True) Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
- strftime(format)
Return a formatted string of the Timestamp.
- Parameters:
format (str) – Format string to convert Timestamp to string. See strftime documentation for more information on the format string: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
Examples
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651') >>> ts.strftime('%Y-%m-%d %X') '2020-03-14 15:32:52'
- classmethod strptime(string, format)
Function is not implemented. Use pd.to_datetime().
- to_julian_date()
Convert TimeStamp to a Julian Date.
0 Julian date is noon January 1, 4713 BC.
Examples
>>> ts = pd.Timestamp('2020-03-14T15:32:52') >>> ts.to_julian_date() 2458923.147824074
- Return type:
float64
- classmethod today(tz=None)
Return the current time in the local timezone.
This differs from datetime.today() in that it can be localized to a passed timezone.
- Parameters:
tz (str or timezone object, default None) – Timezone to localize to.
Examples
>>> pd.Timestamp.today() Timestamp('2020-11-16 22:37:39.969883')
Analogous for
pd.NaT:>>> pd.NaT.today() NaT
- property tz
Alias for tzinfo.
Examples
>>> ts = pd.Timestamp(1584226800, unit='s', tz='Europe/Stockholm') >>> ts.tz <DstTzInfo 'Europe/Stockholm' CET+1:00:00 STD>
- tz_convert(tz)
Convert timezone-aware Timestamp to another time zone.
- Parameters:
tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding UTC time.
- Returns:
converted
- Return type:
- Raises:
TypeError – If Timestamp is tz-naive.
Examples
Create a timestamp object with UTC timezone:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC') >>> ts Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')
Change to Tokyo timezone:
>>> ts.tz_convert(tz='Asia/Tokyo') Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')
Can also use
astimezone:>>> ts.astimezone(tz='Asia/Tokyo') Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')
Analogous for
pd.NaT:>>> pd.NaT.tz_convert(tz='Asia/Tokyo') NaT
- tz_localize(tz, ambiguous='raise', nonexistent='raise')
Localize the Timestamp to a timezone.
Convert naive Timestamp to local time zone or remove timezone from timezone-aware Timestamp.
- Parameters:
tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding local time.
ambiguous (bool, 'NaT', default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
The behavior is as follows:
bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).
’NaT’ will return NaT for an ambiguous time.
’raise’ will raise an AmbiguousTimeError for an ambiguous time.
nonexistent ('shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.
The behavior is as follows:
’shift_forward’ will shift the nonexistent time forward to the closest existing time.
’shift_backward’ will shift the nonexistent time backward to the closest existing time.
’NaT’ will return NaT where there are nonexistent times.
timedelta objects will shift nonexistent times by the timedelta.
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns:
localized
- Return type:
- Raises:
TypeError – If the Timestamp is tz-aware and tz is not None.
Examples
Create a naive timestamp object:
>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651') >>> ts Timestamp('2020-03-14 15:32:52.192548651')
Add ‘Europe/Stockholm’ as timezone:
>>> ts.tz_localize(tz='Europe/Stockholm') Timestamp('2020-03-14 15:32:52.192548651+0100', tz='Europe/Stockholm')
Analogous for
pd.NaT:>>> pd.NaT.tz_localize() NaT
- classmethod utcfromtimestamp(ts)
Construct a timezone-aware UTC datetime from a POSIX timestamp.
Notes
Timestamp.utcfromtimestamp behavior differs from datetime.utcfromtimestamp in returning a timezone-aware object.
Examples
>>> pd.Timestamp.utcfromtimestamp(1584199972) Timestamp('2020-03-14 15:32:52+0000', tz='UTC')
- classmethod utcnow()
Return a new Timestamp representing UTC day and time.
Examples
>>> pd.Timestamp.utcnow() Timestamp('2020-11-16 22:50:18.092888+0000', tz='UTC')
- weekday()
Return the day of the week represented by the date.
Monday == 0 … Sunday == 6.
- class pandas.UInt16Dtype[source]
An ExtensionDtype for uint16 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
uint16
- class pandas.UInt32Dtype[source]
An ExtensionDtype for uint32 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
uint32
- class pandas.UInt64Dtype[source]
An ExtensionDtype for uint64 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
uint64
- class pandas.UInt8Dtype[source]
An ExtensionDtype for uint8 integer data.
Uses
pandas.NAas its missing value, rather thannumpy.nan.- None
- None()
- type
alias of
uint8
- pandas.array(data, dtype=None, copy=True)[source]
Create an array.
- Parameters:
data (Sequence of objects) –
The scalars inside data should be instances of the scalar type for dtype. It’s expected that data represents a 1-dimensional array of data.
When data is an Index or Series, the underlying array will be extracted from data.
dtype (str, np.dtype, or ExtensionDtype, optional) –
The dtype to use for the array. This may be a NumPy dtype or an extension type registered with pandas using
pandas.api.extensions.register_extension_dtype().If not specified, there are two possibilities:
When data is a
Series,Index, orExtensionArray, the dtype will be taken from the data.Otherwise, pandas will attempt to infer the dtype from the data.
Note that when data is a NumPy array,
data.dtypeis not used for inferring the array type. This is because NumPy cannot represent all the types of data that can be held in extension arrays.Currently, pandas will infer an extension dtype for sequences of
Scalar Type
Array Type
pandas.arrays.IntervalArraypandas.arrays.PeriodArraypandas.arrays.DatetimeArraypandas.arrays.TimedeltaArraypandas.arrays.IntegerArraypandas.arrays.FloatingArraypandas.arrays.StringArrayorpandas.arrays.ArrowStringArraypandas.arrays.BooleanArrayThe ExtensionArray created when the scalar type is
stris determined bypd.options.mode.string_storageif the dtype is not explicitly given.For all other cases, NumPy’s usual inference rules will be used.
Changed in version 1.2.0: Pandas now also infers nullable-floating dtype for float-like input data
copy (bool, default True) – Whether to copy the data, even if not necessary. Depending on the type of data, creating the new array may require copying data, even if
copy=False.
- Returns:
The newly created array.
- Return type:
ExtensionArray
- Raises:
ValueError – When data is not 1-dimensional.
See also
numpy.arrayConstruct a NumPy array.
SeriesConstruct a pandas Series.
IndexConstruct a pandas Index.
arrays.PandasArrayExtensionArray wrapping a NumPy array.
Series.arrayExtract the array stored within a Series.
Notes
Omitting the dtype argument means pandas will attempt to infer the best array type from the values in the data. As new array types are added by pandas and 3rd party libraries, the “best” array type may change. We recommend specifying dtype to ensure that
the correct array type for the data is returned
the returned array type doesn’t change as new extension types are added by pandas and third-party libraries
Additionally, if the underlying memory representation of the returned array matters, we recommend specifying the dtype as a concrete object rather than a string alias or allowing it to be inferred. For example, a future version of pandas or a 3rd-party library may include a dedicated ExtensionArray for string data. In this event, the following would no longer return a
arrays.PandasArraybacked by a NumPy array.>>> pd.array(['a', 'b'], dtype=str) <PandasArray> ['a', 'b'] Length: 2, dtype: str32
This would instead return the new ExtensionArray dedicated for string data. If you really need the new array to be backed by a NumPy array, specify that in the dtype.
>>> pd.array(['a', 'b'], dtype=np.dtype("<U1")) <PandasArray> ['a', 'b'] Length: 2, dtype: str32
Finally, Pandas has arrays that mostly overlap with NumPy
arrays.DatetimeArrayarrays.TimedeltaArray
When data with a
datetime64[ns]ortimedelta64[ns]dtype is passed, pandas will always return aDatetimeArrayorTimedeltaArrayrather than aPandasArray. This is for symmetry with the case of timezone-aware data, which NumPy does not natively support.>>> pd.array(['2015', '2016'], dtype='datetime64[ns]') <DatetimeArray> ['2015-01-01 00:00:00', '2016-01-01 00:00:00'] Length: 2, dtype: datetime64[ns]
>>> pd.array(["1H", "2H"], dtype='timedelta64[ns]') <TimedeltaArray> ['0 days 01:00:00', '0 days 02:00:00'] Length: 2, dtype: timedelta64[ns]
Examples
If a dtype is not specified, pandas will infer the best dtype from the values. See the description of dtype for the types pandas infers for.
>>> pd.array([1, 2]) <IntegerArray> [1, 2] Length: 2, dtype: Int64
>>> pd.array([1, 2, np.nan]) <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64
>>> pd.array([1.1, 2.2]) <FloatingArray> [1.1, 2.2] Length: 2, dtype: Float64
>>> pd.array(["a", None, "c"]) <StringArray> ['a', <NA>, 'c'] Length: 3, dtype: string
>>> with pd.option_context("string_storage", "pyarrow"): ... arr = pd.array(["a", None, "c"]) ... >>> arr <ArrowStringArray> ['a', <NA>, 'c'] Length: 3, dtype: string
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) <PeriodArray> ['2000-01-01', '2000-01-01'] Length: 2, dtype: period[D]
You can use the string alias for dtype
>>> pd.array(['a', 'b', 'a'], dtype='category') ['a', 'b', 'a'] Categories (2, object): ['a', 'b']
Or specify the actual dtype
>>> pd.array(['a', 'b', 'a'], ... dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)) ['a', 'b', 'a'] Categories (3, object): ['a' < 'b' < 'c']
If pandas does not infer a dedicated extension type a
arrays.PandasArrayis returned.>>> pd.array([1 + 1j, 3 + 2j]) <PandasArray> [(1+1j), (3+2j)] Length: 2, dtype: complex128
As mentioned in the “Notes” section, new extension types may be added in the future (by pandas or 3rd party libraries), causing the return value to no longer be a
arrays.PandasArray. Specify the dtype as a NumPy dtype if you need to ensure there’s no future change in behavior.>>> pd.array([1, 2], dtype=np.dtype("int32")) <PandasArray> [1, 2] Length: 2, dtype: int32
data must be 1-dimensional. A ValueError is raised when the input has the wrong dimensionality.
>>> pd.array(1) Traceback (most recent call last): ... ValueError: Cannot pass scalar '1' to 'pandas.array'.
- pandas.bdate_range(start=None, end=None, periods=None, freq='B', tz=None, normalize=True, name=None, weekmask=None, holidays=None, inclusive='both', **kwargs)[source]
Return a fixed frequency DatetimeIndex with business day as the default.
- Parameters:
start (str or datetime-like, default None) – Left bound for generating dates.
end (str or datetime-like, default None) – Right bound for generating dates.
periods (int, default None) – Number of periods to generate.
freq (str, Timedelta, datetime.timedelta, or DateOffset, default 'B') – Frequency strings can have multiples, e.g. ‘5H’. The default is business daily (‘B’).
tz (str or None) – Time zone name for returning localized DatetimeIndex, for example Asia/Beijing.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
weekmask (str or None, default None) – Weekmask of valid business days, passed to
numpy.busdaycalendar, only used when custom frequency strings are passed. The default value None is equivalent to ‘Mon Tue Wed Thu Fri’.holidays (list-like or None, default None) – Dates to exclude from the set of valid business days, passed to
numpy.busdaycalendar, only used when custom frequency strings are passed.inclusive ({"both", "neither", "left", "right"}, default "both") –
Include boundaries; Whether to set each bound as closed or open.
New in version 1.4.0.
**kwargs – For compatibility. Has no effect on the result.
- Return type:
Notes
Of the four parameters:
start,end,periods, andfreq, exactly three must be specified. Specifyingfreqis a requirement forbdate_range. Usedate_rangeif specifyingfreqis not desired.To learn more about the frequency strings, please see this link.
Examples
Note how the two weekend days are skipped in the result.
>>> pd.bdate_range(start='1/1/2018', end='1/08/2018') DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08'], dtype='datetime64[ns]', freq='B')
- pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)[source]
Concatenate pandas objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
- Parameters:
objs (a sequence or mapping of Series or DataFrame objects) – If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis ({0/'index', 1/'columns'}, default 0) – The axis to concatenate along.
join ({'inner', 'outer'}, default 'outer') – How to handle indexes on other axis (or axes).
ignore_index (bool, default False) – If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
keys (sequence, default None) – If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
levels (list of sequences, default None) – Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
names (list, default None) – Names for the levels in the resulting hierarchical index.
verify_integrity (bool, default False) – Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
sort (bool, default False) – Sort non-concatenation axis if it is not already aligned.
copy (bool, default True) – If False, do not copy data unnecessarily.
- Returns:
When concatenating all
Seriesalong the index (axis=0), aSeriesis returned. Whenobjscontains at least oneDataFrame, aDataFrameis returned. When concatenating along the columns (axis=1), aDataFrameis returned.- Return type:
See also
DataFrame.joinJoin DataFrames using indexes.
DataFrame.mergeMerge DataFrames by indexes or columns.
Notes
The keys, levels, and names arguments are all optional.
A walkthrough of how this method fits in with other tools for combining pandas objects can be found here.
It is not recommended to build DataFrames by adding single rows in a for loop. Build a list of rows and make a DataFrame in a single concat.
Examples
Combine two
Series.>>> s1 = pd.Series(['a', 'b']) >>> s2 = pd.Series(['c', 'd']) >>> pd.concat([s1, s2]) 0 a 1 b 0 c 1 d dtype: object
Clear the existing index and reset it in the result by setting the
ignore_indexoption toTrue.>>> pd.concat([s1, s2], ignore_index=True) 0 a 1 b 2 c 3 d dtype: object
Add a hierarchical index at the outermost level of the data with the
keysoption.>>> pd.concat([s1, s2], keys=['s1', 's2']) s1 0 a 1 b s2 0 c 1 d dtype: object
Label the index keys you create with the
namesoption.>>> pd.concat([s1, s2], keys=['s1', 's2'], ... names=['Series name', 'Row ID']) Series name Row ID s1 0 a 1 b s2 0 c 1 d dtype: object
Combine two
DataFrameobjects with identical columns.>>> df1 = pd.DataFrame([['a', 1], ['b', 2]], ... columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.DataFrame([['c', 3], ['d', 4]], ... columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1, df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4
Combine
DataFrameobjects with overlapping columns and return everything. Columns outside the intersection will be filled withNaNvalues.>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], ... columns=['letter', 'number', 'animal']) >>> df3 letter number animal 0 c 3 cat 1 d 4 dog >>> pd.concat([df1, df3], sort=False) letter number animal 0 a 1 NaN 1 b 2 NaN 0 c 3 cat 1 d 4 dog
Combine
DataFrameobjects with overlapping columns and return only those that are shared by passinginnerto thejoinkeyword argument.>>> pd.concat([df1, df3], join="inner") letter number 0 a 1 1 b 2 0 c 3 1 d 4
Combine
DataFrameobjects horizontally along the x axis by passing inaxis=1.>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], ... columns=['animal', 'name']) >>> pd.concat([df1, df4], axis=1) letter number animal name 0 a 1 bird polly 1 b 2 monkey george
Prevent the result from including duplicate index values with the
verify_integrityoption.>>> df5 = pd.DataFrame([1], index=['a']) >>> df5 0 a 1 >>> df6 = pd.DataFrame([2], index=['a']) >>> df6 0 a 2 >>> pd.concat([df5, df6], verify_integrity=True) Traceback (most recent call last): ... ValueError: Indexes have overlapping values: ['a']
Append a single row to the end of a
DataFrameobject.>>> df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0]) >>> df7 a b 0 1 2 >>> new_row = pd.Series({'a': 3, 'b': 4}) >>> new_row a 3 b 4 dtype: int64 >>> pd.concat([df7, new_row.to_frame().T], ignore_index=True) a b 0 1 2 1 3 4
- pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)[source]
Compute a simple cross tabulation of two (or more) factors.
By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.
- Parameters:
index (array-like, Series, or list of arrays/Series) – Values to group by in the rows.
columns (array-like, Series, or list of arrays/Series) – Values to group by in the columns.
values (array-like, optional) – Array of values to aggregate according to the factors. Requires aggfunc be specified.
rownames (sequence, default None) – If passed, must match number of row arrays passed.
colnames (sequence, default None) – If passed, must match number of column arrays passed.
aggfunc (function, optional) – If specified, requires values be specified as well.
margins (bool, default False) – Add row/column margins (subtotals).
margins_name (str, default 'All') – Name of the row/column that will contain the totals when margins is True.
dropna (bool, default True) – Do not include columns whose entries are all NaN.
normalize (bool, {'all', 'index', 'columns'}, or {0,1}, default False) –
Normalize by dividing all values by the sum of values.
If passed ‘all’ or True, will normalize over all values.
If passed ‘index’ will normalize over each row.
If passed ‘columns’ will normalize over each column.
If margins is True, will also normalize margin values.
- Returns:
Cross tabulation of the data.
- Return type:
See also
DataFrame.pivotReshape data based on column values.
pivot_tableCreate a pivot table as a DataFrame.
Notes
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified.
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
In the event that there aren’t overlapping indexes an empty DataFrame will be returned.
Reference the user guide for more examples.
Examples
>>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", ... "bar", "bar", "foo", "foo", "foo"], dtype=object) >>> b = np.array(["one", "one", "one", "two", "one", "one", ... "one", "two", "two", "two", "one"], dtype=object) >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", ... "shiny", "dull", "shiny", "shiny", "shiny"], ... dtype=object) >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c']) b one two c dull shiny dull shiny a bar 1 2 1 0 foo 2 2 1 2
Here ‘c’ and ‘f’ are not represented in the data and will not be shown in the output because dropna is True by default. Set dropna=False to preserve categories with no data.
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c']) >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f']) >>> pd.crosstab(foo, bar) col_0 d e row_0 a 1 0 b 0 1 >>> pd.crosstab(foo, bar, dropna=False) col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
- pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)[source]
Bin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
- Parameters:
x (array-like) – The input array to be binned. Must be 1-dimensional.
bins (int, sequence of scalars, or IntervalIndex) –
The criteria to bin by.
int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.
right (bool, default True) – Indicates whether bins includes the rightmost edge or not. If
right == True(the default), then the bins[1, 2, 3, 4]indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.labels (array or False, default None) – Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex. If True, raises an error. When ordered=False, labels must be provided.
retbins (bool, default False) – Whether to return the bins or not. Useful when bins is provided as a scalar.
precision (int, default 3) – The precision at which to store and display the bins labels.
include_lowest (bool, default False) – Whether the first interval should be left-inclusive or not.
duplicates ({default 'raise', 'drop'}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques.
ordered (bool, default True) –
Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype). If True, the resulting categorical will be ordered. If False, the resulting categorical will be unordered (labels must be provided).
New in version 1.1.0.
- Returns:
out (Categorical, Series, or ndarray) – An array-like object representing the respective bin for each value of x. The type depends on the value of labels.
None (default) : returns a Series for Series x or a Categorical for all other inputs. The values stored within are Interval dtype.
sequence of scalars : returns a Series for Series x or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is.
False : returns an ndarray of integers.
bins (numpy.ndarray or IntervalIndex.) – The computed or specified bins. Only returned when retbins=True. For scalar or sequence bins, this is an ndarray with the computed bins. If set duplicates=drop, bins will drop non-unique bin. For an IntervalIndex bins, this is equal to bins.
See also
qcutDiscretize variable into equal-sized buckets based on rank or based on sample quantiles.
CategoricalArray type for storing data that come from a fixed set of values.
SeriesOne-dimensional array with axis labels (including time series).
IntervalIndexImmutable Index implementing an ordered, sliceable set.
Notes
Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or Categorical object.
Reference the user guide for more examples.
Examples
Discretize into three equal-sized bins.
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3) ... [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True) ... ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ... array([0.994, 3. , 5. , 7. ]))
Discovers the same bins, but assign them specific labels. Notice that the returned Categorical’s categories are labels and is ordered.
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), ... 3, labels=["bad", "medium", "good"]) ['bad', 'good', 'medium', 'medium', 'good', 'bad'] Categories (3, object): ['bad' < 'medium' < 'good']
ordered=Falsewill result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, ... labels=["B", "A", "B"], ordered=False) ['B', 'B', 'A', 'A', 'B', 'B'] Categories (2, object): ['A', 'B']
labels=Falseimplies you just want the bins back.>>> pd.cut([0, 1, 1, 2], bins=4, labels=False) array([0, 1, 1, 3])
Passing a Series as an input returns a Series with categorical dtype:
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]), ... index=['a', 'b', 'c', 'd', 'e']) >>> pd.cut(s, 3) ... a (1.992, 4.667] b (1.992, 4.667] c (4.667, 7.333] d (7.333, 10.0] e (7.333, 10.0] dtype: category Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...
Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based on bins.
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]), ... index=['a', 'b', 'c', 'd', 'e']) >>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False) ... (a 1.0 b 2.0 c 3.0 d 4.0 e NaN dtype: float64, array([ 0, 2, 4, 6, 8, 10]))
Use drop optional when bins is not unique
>>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True, ... right=False, duplicates='drop') ... (a 1.0 b 2.0 c 3.0 d 3.0 e NaN dtype: float64, array([ 0, 2, 4, 6, 10]))
Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between two bins.
>>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)]) >>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins) [NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]] Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]
- pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, inclusive='both', *, unit=None, **kwargs)[source]
Return a fixed frequency DatetimeIndex.
Returns the range of equally spaced time points (where the difference between any two adjacent points is specified by the given frequency) such that they all satisfy start <[=] x <[=] end, where the first one and the last one are, resp., the first and last time points in that range that fall on the boundary of
freq(if given as a frequency string) or that are valid forfreq(if given as apandas.tseries.offsets.DateOffset). (If exactly one ofstart,end, orfreqis not specified, this missing parameter can be computed givenperiods, the number of timesteps in the range. See the note below.)- Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str, datetime.timedelta, or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See here for a list of frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive unless timezone-aware datetime-likes are passed.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
inclusive ({"both", "neither", "left", "right"}, default "both") –
Include boundaries; Whether to set each bound as closed or open.
New in version 1.4.0.
unit (str, default None) –
Specify the desired resolution of the result.
New in version 2.0.0.
**kwargs – For compatibility. Has no effect on the result.
- Return type:
See also
DatetimeIndexAn immutable container for datetimes.
timedelta_rangeReturn a fixed frequency TimedeltaIndex.
period_rangeReturn a fixed frequency PeriodIndex.
interval_rangeReturn a fixed frequency IntervalIndex.
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingDatetimeIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
Examples
Specifying the values
The next four examples generate the same DatetimeIndex, but vary the combination of start, end and periods.
Specify start and end, with the default daily frequency.
>>> pd.date_range(start='1/1/2018', end='1/08/2018') DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'], dtype='datetime64[ns]', freq='D')
Specify timezone-aware start and end, with the default daily frequency.
>>> pd.date_range( ... start=pd.to_datetime("1/1/2018").tz_localize("Europe/Berlin"), ... end=pd.to_datetime("1/08/2018").tz_localize("Europe/Berlin"), ... ) DatetimeIndex(['2018-01-01 00:00:00+01:00', '2018-01-02 00:00:00+01:00', '2018-01-03 00:00:00+01:00', '2018-01-04 00:00:00+01:00', '2018-01-05 00:00:00+01:00', '2018-01-06 00:00:00+01:00', '2018-01-07 00:00:00+01:00', '2018-01-08 00:00:00+01:00'], dtype='datetime64[ns, Europe/Berlin]', freq='D')
Specify start and periods, the number of periods (days).
>>> pd.date_range(start='1/1/2018', periods=8) DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'], dtype='datetime64[ns]', freq='D')
Specify end and periods, the number of periods (days).
>>> pd.date_range(end='1/1/2018', periods=8) DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'], dtype='datetime64[ns]', freq='D')
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
>>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3) DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00', '2018-04-27 00:00:00'], dtype='datetime64[ns]', freq=None)
Other Parameters
Changed the freq (frequency) to
'M'(month end frequency).>>> pd.date_range(start='1/1/2018', periods=5, freq='M') DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31'], dtype='datetime64[ns]', freq='M')
Multiples are allowed
>>> pd.date_range(start='1/1/2018', periods=5, freq='3M') DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31', '2019-01-31'], dtype='datetime64[ns]', freq='3M')
freq can also be specified as an Offset object.
>>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3)) DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31', '2019-01-31'], dtype='datetime64[ns]', freq='3M')
Specify tz to set the timezone.
>>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo') DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00', '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00', '2018-01-05 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq='D')
inclusive controls whether to include start and end that are on the boundary. The default, “both”, includes boundary points on either end.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive="both") DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D')
Use
inclusive='left'to exclude end if it falls on the boundary.>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive='left') DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq='D')
Use
inclusive='right'to exclude start if it falls on the boundary, and similarlyinclusive='neither'will exclude both start and end.>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive='right') DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D')
Specify a unit
>>> pd.date_range(start="2017-01-01", periods=10, freq="100AS", unit="s") DatetimeIndex(['2017-01-01', '2117-01-01', '2217-01-01', '2317-01-01', '2417-01-01', '2517-01-01', '2617-01-01', '2717-01-01', '2817-01-01', '2917-01-01'], dtype='datetime64[s]', freq='100AS-JAN')
- pandas.eval(expr, parser='pandas', engine=None, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)[source]
Evaluate a Python expression as a string using various backends.
The following arithmetic operations are supported:
+,-,*,/,**,%,//(python engine only) along with the following boolean operations:|(or),&(and), and~(not). Additionally, the'pandas'parser allows the use ofand,or, andnotwith the same semantics as the corresponding bitwise operators.SeriesandDataFrameobjects are supported and behave as they would with plain ol’ Python evaluation.- Parameters:
expr (str) – The expression to evaluate. This string cannot contain any Python statements, only Python expressions.
parser ({'pandas', 'python'}, default 'pandas') – The parser to use to construct the syntax tree from the expression. The default of
'pandas'parses code slightly different than standard Python. Alternatively, you can parse an expression using the'python'parser to retain strict Python semantics. See the enhancing performance documentation for more details.engine ({'python', 'numexpr'}, default 'numexpr') –
The engine used to evaluate the expression. Supported engines are
None : tries to use
numexpr, falls back topython'numexpr': This default engine evaluates pandas objects using numexpr for large speed ups in complex expressions with large frames.'python': Performs operations as if you hadeval’d in top level python. This engine is generally not that useful.
More backends may be available in the future.
local_dict (dict or None, optional) – A dictionary of local variables, taken from locals() by default.
global_dict (dict or None, optional) – A dictionary of global variables, taken from globals() by default.
resolvers (list of dict-like or None, optional) – A list of objects implementing the
__getitem__special method that you can use to inject an additional collection of namespaces to use for variable lookup. For example, this is used in thequery()method to inject theDataFrame.indexandDataFrame.columnsvariables that refer to their respectiveDataFrameinstance attributes.level (int, optional) – The number of prior stack frames to traverse and add to the current scope. Most users will not need to change this parameter.
target (object, optional, default None) – This is the target object for assignment. It is used when there is variable assignment in the expression. If so, then target must support item assignment with string keys, and if a copy is being returned, it must also support .copy().
inplace (bool, default False) – If target is provided, and the expression mutates target, whether to modify target inplace. Otherwise, return a copy of target with the mutation.
- Returns:
The completion value of evaluating the given code or None if
inplace=True.- Return type:
- Raises:
ValueError – There are many instances where such an error can be raised: - target=None, but the expression is multiline. - The expression is multiline, but not all them have item assignment. An example of such an arrangement is this: a = b + 1 a + 2 Here, there are expressions on different lines, making it multiline, but the last line has no variable assigned to the output of a + 2. - inplace=True, but the expression is missing item assignment. - Item assignment is provided, but the target does not support string item assignment. - Item assignment is provided and inplace=False, but the target does not support the .copy() method
See also
DataFrame.queryEvaluates a boolean expression to query the columns of a frame.
DataFrame.evalEvaluate a string describing operations on DataFrame columns.
Notes
The
dtypeof any objects involved in an arithmetic%operation are recursively cast tofloat64.See the enhancing performance documentation for more details.
Examples
>>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]}) >>> df animal age 0 dog 10 1 pig 20
We can add a new column using
pd.eval:>>> pd.eval("double_age = df.age * 2", target=df) animal age double_age 0 dog 10 20 1 pig 20 40
- pandas.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function
pandas.factorize(), and as a methodSeries.factorize()andIndex.factorize().- Parameters:
values (sequence) – A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.
sort (bool, default False) – Sort uniques and shuffle codes to maintain the relationship.
use_na_sentinel (bool, default True) –
If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values.
New in version 1.5.0.
size_hint (int, optional) – Hint to the hashtable sizer.
- Returns:
codes (ndarray) – An integer ndarray that’s an indexer into uniques.
uniques.take(codes)will have the same values as values.uniques (ndarray, Index, or Categorical) – The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.
Note
Even if there’s a missing value in values, uniques will not contain an entry for it.
- Return type:
Notes
Reference the user guide for more examples.
Examples
These examples all show factorize as a top-level method like
pd.factorize(values). The results are identical for methods likeSeries.factorize().>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) >>> codes array([0, 0, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
With
sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) >>> codes array([1, 1, 0, 2, 1]) >>> uniques array(['a', 'b', 'c'], dtype=object)
When
use_na_sentinel=True(the default), missing values are indicated in the codes with the sentinel value-1and missing values are not included in uniques.>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) >>> codes array([ 0, -1, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques ['a', 'c'] Categories (3, object): ['a', 'b', 'c']
Notice that
'b'is inuniques.categories, despite not being present incat.values.For all other pandas objects, an Index of the appropriate type is returned.
>>> cat = pd.Series(['a', 'a', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques Index(['a', 'c'], dtype='object')
If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting
use_na_sentinel=False.>>> values = np.array([1, 2, 1, np.nan]) >>> codes, uniques = pd.factorize(values) # default: use_na_sentinel=True >>> codes array([ 0, 1, 0, -1]) >>> uniques array([1., 2.])
>>> codes, uniques = pd.factorize(values, use_na_sentinel=False) >>> codes array([0, 1, 0, 2]) >>> uniques array([ 1., 2., nan])
- pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)[source]
Convert categorical variable into dummy/indicator variables.
Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.
- Parameters:
data (array-like, Series, or DataFrame) – Data of which to get dummy indicators.
prefix (str, list of str, or dict of str, default None) – String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep (str, default '_') – If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
dummy_na (bool, default False) – Add a column to indicate NaNs, if False NaNs are ignored.
columns (list-like, default None) – Column names in the DataFrame to be encoded. If columns is None then all the columns with object, string, or category dtype will be converted.
sparse (bool, default False) – Whether the dummy-encoded columns should be backed by a
SparseArray(True) or a regular NumPy array (False).drop_first (bool, default False) – Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtype (dtype, default bool) – Data type for new columns. Only a single dtype is allowed.
- Returns:
Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.
- Return type:
See also
Series.str.get_dummiesConvert Series of strings to dummy codes.
from_dummies()Convert dummy codes to categorical
DataFrame.
Notes
Reference the user guide for more examples.
Examples
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s) a b c 0 True False False 1 False True False 2 False False True 3 True False False
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1) a b 0 True False 1 False True 2 False False
>>> pd.get_dummies(s1, dummy_na=True) a b NaN 0 True False False 1 False True False 2 False False True
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 True False False True False 1 2 False True True False False 2 3 True False False False True
>>> pd.get_dummies(pd.Series(list('abcaa'))) a b c 0 True False False 1 False True False 2 False False True 3 True False False 4 True False False
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True) b c 0 False False 1 True False 2 False True 3 False False 4 False False
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0
- pandas.from_dummies(data, sep=None, default_category=None)[source]
Create a categorical
DataFramefrom aDataFrameof dummy variables.Inverts the operation performed by
get_dummies().New in version 1.5.0.
- Parameters:
data (DataFrame) – Data which contains dummy-coded variables in form of integer columns of 1’s and 0’s.
sep (str, default None) – Separator used in the column names of the dummy categories they are character indicating the separation of the categorical names from the prefixes. For example, if your column names are ‘prefix_A’ and ‘prefix_B’, you can strip the underscore by specifying sep=’_’.
default_category (None, Hashable or dict of Hashables, default None) – The default category is the implied category when a value has none of the listed categories specified with a one, i.e. if all dummies in a row are zero. Can be a single value for all variables or a dict directly mapping the default categories to a prefix of a variable.
- Returns:
Categorical data decoded from the dummy input-data.
- Return type:
- Raises:
When the input
DataFramedatacontains NA values. * When the inputDataFramedatacontains column names with separators that do not match the separator specified withsep. * When adictpassed todefault_categorydoes not include an implied category for each prefix. * When a value indatahas more than one category assigned to it. * Whendefault_category=Noneand a value indatahas no category assigned to it.
When the input
datais not of typeDataFrame. * When the inputDataFramedatacontains non-dummy data. * When the passedsepis of a wrong data type. * When the passeddefault_categoryis of a wrong data type.
See also
get_dummies()Convert
SeriesorDataFrameto dummy codes.CategoricalRepresent a categorical variable in classic.
Notes
The columns of the passed dummy data should only include 1’s and 0’s, or boolean values.
Examples
>>> df = pd.DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0], ... "c": [0, 0, 1, 0]})
>>> df a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0
>>> pd.from_dummies(df) 0 a 1 b 2 c 3 a
>>> df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], ... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], ... "col2_c": [0, 0, 1]})
>>> df col1_a col1_b col2_a col2_b col2_c 0 1 0 0 1 0 1 0 1 1 0 0 2 1 0 0 0 1
>>> pd.from_dummies(df, sep="_") col1 col2 0 a b 1 b a 2 a c
>>> df = pd.DataFrame({"col1_a": [1, 0, 0], "col1_b": [0, 1, 0], ... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], ... "col2_c": [0, 0, 0]})
>>> df col1_a col1_b col2_a col2_b col2_c 0 1 0 0 1 0 1 0 1 1 0 0 2 0 0 0 0 0
>>> pd.from_dummies(df, sep="_", default_category={"col1": "d", "col2": "e"}) col1 col2 0 a b 1 b a 2 d e
- pandas.infer_freq(index)[source]
Infer the most likely frequency given the input index.
- Parameters:
index (DatetimeIndex or TimedeltaIndex) – If passed a Series will use the values of the series (NOT THE INDEX).
- Returns:
None if no discernible frequency.
- Return type:
str or None
- Raises:
TypeError – If the index is not datetime-like.
ValueError – If there are fewer than three values.
Examples
>>> idx = pd.date_range(start='2020/12/01', end='2020/12/30', periods=30) >>> pd.infer_freq(idx) 'D'
- pandas.interval_range(start=None, end=None, periods=None, freq=None, name=None, closed='right')[source]
Return a fixed frequency IntervalIndex.
- Parameters:
start (numeric or datetime-like, default None) – Left bound for generating intervals.
end (numeric or datetime-like, default None) – Right bound for generating intervals.
periods (int, default None) – Number of periods to generate.
freq (numeric, str, datetime.timedelta, or DateOffset, default None) – The length of each interval. Must be consistent with the type of start and end, e.g. 2 for numeric, or ‘5H’ for datetime-like. Default is 1 for numeric and ‘D’ for datetime-like.
name (str, default None) – Name of the resulting IntervalIndex.
closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.
- Return type:
See also
IntervalIndexAn Index of intervals that are all closed on the same side.
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingIntervalIndexwill haveperiodslinearly spaced elements betweenstartandend, inclusively.To learn more about datetime-like frequency strings, please see this link.
Examples
Numeric
startandendis supported.>>> pd.interval_range(start=0, end=5) IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
Additionally, datetime-like input is also supported.
>>> pd.interval_range(start=pd.Timestamp('2017-01-01'), ... end=pd.Timestamp('2017-01-04')) IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04]], dtype='interval[datetime64[ns], right]')
The
freqparameter specifies the frequency between the left and right. endpoints of the individual intervals within theIntervalIndex. For numericstartandend, the frequency must also be numeric.>>> pd.interval_range(start=0, periods=4, freq=1.5) IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')
Similarly, for datetime-like
startandend, the frequency must be convertible to a DateOffset.>>> pd.interval_range(start=pd.Timestamp('2017-01-01'), ... periods=3, freq='MS') IntervalIndex([(2017-01-01, 2017-02-01], (2017-02-01, 2017-03-01], (2017-03-01, 2017-04-01]], dtype='interval[datetime64[ns], right]')
Specify
start,end, andperiods; the frequency is generated automatically (linearly spaced).>>> pd.interval_range(start=0, end=6, periods=4) IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')
The
closedparameter specifies which endpoints of the individual intervals within theIntervalIndexare closed.>>> pd.interval_range(end=5, periods=4, closed='both') IntervalIndex([[1, 2], [2, 3], [3, 4], [4, 5]], dtype='interval[int64, both]')
- pandas.isna(obj)[source]
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (
NaNin numeric arrays,NoneorNaNin object arrays,NaTin datetimelike).- Parameters:
obj (scalar or array-like) – Object to check for null or missing values.
- Returns:
For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is missing.
- Return type:
See also
notnaBoolean inverse of pandas.isna.
Series.isnaDetect missing values in a Series.
DataFrame.isnaDetect missing values in a DataFrame.
Index.isnaDetect missing values in an Index.
Examples
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog') False
>>> pd.isna(pd.NA) True
>>> pd.isna(np.nan) True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]]) >>> array array([[ 1., nan, 3.], [ 4., 5., nan]]) >>> pd.isna(array) array([[False, True, False], [False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, ... "2017-07-08"]) >>> index DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None) >>> pd.isna(index) array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']]) >>> df 0 1 2 0 ant bee cat 1 dog None fly >>> pd.isna(df) 0 1 2 0 False False False 1 False True False
>>> pd.isna(df[1]) 0 False 1 True Name: 1, dtype: bool
- pandas.isnull(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (
NaNin numeric arrays,NoneorNaNin object arrays,NaTin datetimelike).- Parameters:
obj (scalar or array-like) – Object to check for null or missing values.
- Returns:
For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is missing.
- Return type:
See also
notnaBoolean inverse of pandas.isna.
Series.isnaDetect missing values in a Series.
DataFrame.isnaDetect missing values in a DataFrame.
Index.isnaDetect missing values in an Index.
Examples
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog') False
>>> pd.isna(pd.NA) True
>>> pd.isna(np.nan) True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]]) >>> array array([[ 1., nan, 3.], [ 4., 5., nan]]) >>> pd.isna(array) array([[False, True, False], [False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, ... "2017-07-08"]) >>> index DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None) >>> pd.isna(index) array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']]) >>> df 0 1 2 0 ant bee cat 1 dog None fly >>> pd.isna(df) 0 1 2 0 False False False 1 False True False
>>> pd.isna(df[1]) 0 False 1 True Name: 1, dtype: bool
- pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, record_prefix=None, errors='raise', sep='.', max_level=None)[source]
Normalize semi-structured JSON data into a flat table.
- Parameters:
record_path (str or list of str, default None) – Path in each object to list of records. If not passed, data will be assumed to be an array of records.
meta (list of paths (str or list of str), default None) – Fields to use as metadata for each record in resulting table.
meta_prefix (str, default None) – If True, prefix records with dotted (?) path, e.g. foo.bar.field if meta is [‘foo’, ‘bar’].
record_prefix (str, default None) – If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].
errors ({'raise', 'ignore'}, default 'raise') –
Configures error handling.
’ignore’ : will ignore KeyError if keys listed in meta are not always present.
’raise’ : will raise KeyError if keys listed in meta are not always present.
sep (str, default '.') – Nested records will generate names separated by sep. e.g., for sep=’.’, {‘foo’: {‘bar’: 0}} -> foo.bar.
max_level (int, default None) – Max number of levels(depth of dict) to normalize. if None, normalizes all levels.
- Returns:
frame (DataFrame)
Normalize semi-structured JSON data into a flat table.
- Return type:
Examples
>>> data = [ ... {"id": 1, "name": {"first": "Coleen", "last": "Volk"}}, ... {"name": {"given": "Mark", "family": "Regner"}}, ... {"id": 2, "name": "Faye Raker"}, ... ] >>> pd.json_normalize(data) id name.first name.last name.given name.family name 0 1.0 Coleen Volk NaN NaN NaN 1 NaN NaN NaN Mark Regner NaN 2 2.0 NaN NaN NaN NaN Faye Raker
>>> data = [ ... { ... "id": 1, ... "name": "Cole Volk", ... "fitness": {"height": 130, "weight": 60}, ... }, ... {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}}, ... { ... "id": 2, ... "name": "Faye Raker", ... "fitness": {"height": 130, "weight": 60}, ... }, ... ] >>> pd.json_normalize(data, max_level=0) id name fitness 0 1.0 Cole Volk {'height': 130, 'weight': 60} 1 NaN Mark Reg {'height': 130, 'weight': 60} 2 2.0 Faye Raker {'height': 130, 'weight': 60}
Normalizes nested data up to level 1.
>>> data = [ ... { ... "id": 1, ... "name": "Cole Volk", ... "fitness": {"height": 130, "weight": 60}, ... }, ... {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}}, ... { ... "id": 2, ... "name": "Faye Raker", ... "fitness": {"height": 130, "weight": 60}, ... }, ... ] >>> pd.json_normalize(data, max_level=1) id name fitness.height fitness.weight 0 1.0 Cole Volk 130 60 1 NaN Mark Reg 130 60 2 2.0 Faye Raker 130 60
>>> data = [ ... { ... "state": "Florida", ... "shortname": "FL", ... "info": {"governor": "Rick Scott"}, ... "counties": [ ... {"name": "Dade", "population": 12345}, ... {"name": "Broward", "population": 40000}, ... {"name": "Palm Beach", "population": 60000}, ... ], ... }, ... { ... "state": "Ohio", ... "shortname": "OH", ... "info": {"governor": "John Kasich"}, ... "counties": [ ... {"name": "Summit", "population": 1234}, ... {"name": "Cuyahoga", "population": 1337}, ... ], ... }, ... ] >>> result = pd.json_normalize( ... data, "counties", ["state", "shortname", ["info", "governor"]] ... ) >>> result name population state shortname info.governor 0 Dade 12345 Florida FL Rick Scott 1 Broward 40000 Florida FL Rick Scott 2 Palm Beach 60000 Florida FL Rick Scott 3 Summit 1234 Ohio OH John Kasich 4 Cuyahoga 1337 Ohio OH John Kasich
>>> data = {"A": [1, 2]} >>> pd.json_normalize(data, "A", record_prefix="Prefix.") Prefix.0 0 1 1 2
Returns normalized data with columns prefixed with the given string.
- pandas.lreshape(data, groups, dropna=True)[source]
Reshape wide-format data to long. Generalized inverse of DataFrame.pivot.
Accepts a dictionary,
groups, in which each key is a new column name and each value is a list of old column names that will be “melted” under the new column name as part of the reshape.- Parameters:
- Returns:
Reshaped DataFrame.
- Return type:
See also
meltUnpivot a DataFrame from wide to long format, optionally leaving identifiers set.
pivotCreate a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivotPivot without aggregation that can handle non-numeric data.
DataFrame.pivot_tableGeneralization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstackPivot based on the index values instead of a column.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Examples
>>> data = pd.DataFrame({'hr1': [514, 573], 'hr2': [545, 526], ... 'team': ['Red Sox', 'Yankees'], ... 'year1': [2007, 2007], 'year2': [2008, 2008]}) >>> data hr1 hr2 team year1 year2 0 514 545 Red Sox 2007 2008 1 573 526 Yankees 2007 2008
>>> pd.lreshape(data, {'year': ['year1', 'year2'], 'hr': ['hr1', 'hr2']}) team year hr 0 Red Sox 2007 514 1 Yankees 2007 573 2 Red Sox 2008 545 3 Yankees 2008 526
- pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)[source]
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name (scalar) – Name to use for the ‘variable’ column. If None it uses
frame.columns.nameor ‘variable’.value_name (scalar, default 'value') – Name to use for the ‘value’ column.
col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.
ignore_index (bool, default True) –
If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
New in version 1.1.0.
frame (DataFrame) –
- Returns:
Unpivoted DataFrame.
- Return type:
See also
DataFrame.meltIdentical method.
pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivotReturn reshaped DataFrame organized by given index / column values.
DataFrame.explodeExplode a DataFrame from list-like columns to long format.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6
>>> pd.melt(df, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C']) A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6
The names of ‘variable’ and ‘value’ columns can be customized:
>>> pd.melt(df, id_vars=['A'], value_vars=['B'], ... var_name='myVarname', value_name='myValname') A myVarname myValname 0 a B 1 1 b B 3 2 c B 5
Original index values can be kept around:
>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C'], ignore_index=False) A variable value 0 a B 1 1 b B 3 2 c B 5 0 a C 2 1 b C 4 2 c C 6
If you have multi-index columns:
>>> df.columns = [list('ABC'), list('DEF')] >>> df A B C D E F 0 a 1 2 1 b 3 4 2 c 5 6
>>> pd.melt(df, col_level=0, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> pd.melt(df, id_vars=[('A', 'D')], value_vars=[('B', 'E')]) (A, D) variable_0 variable_1 value 0 a B E 1 1 b B E 3 2 c B E 5
- pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)[source]
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
- Parameters:
left (DataFrame or named Series) –
right (DataFrame or named Series) – Object to merge with.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
New in version 1.2.0.
on (label or list) – Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index.
sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
copy (bool, default True) – If False, avoid copy if possible.
indicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.
validate (str, optional) –
If specified, checks if merge is of specified type.
”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
”one_to_many” or “1:m”: check if merge keys are unique in left dataset.
”many_to_one” or “m:1”: check if merge keys are unique in right dataset.
”many_to_many” or “m:m”: allowed, but does not result in checks.
- Returns:
A DataFrame of the two merged objects.
- Return type:
See also
merge_orderedMerge with optional filling/interpolation.
merge_asofMerge on nearest keys.
DataFrame.joinSimilar method using indices.
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0 Support for merging named Series objects was added in version 0.24.0
Examples
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]}) >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.
>>> df1.merge(df2, left_on='lkey', right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', ... suffixes=('_left', '_right')) lkey value_left rkey value_right 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False)) Traceback (most recent call last): ... ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]}) >>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]}) >>> df1 a b 0 foo 1 1 bar 2 >>> df2 a c 0 foo 3 1 baz 4
>>> df1.merge(df2, how='inner', on='a') a b c 0 foo 1 3
>>> df1.merge(df2, how='left', on='a') a b c 0 foo 1 3.0 1 bar 2 NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']}) >>> df2 = pd.DataFrame({'right': [7, 8]}) >>> df1 left 0 foo 1 bar >>> df2 right 0 7 1 8
>>> df1.merge(df2, how='cross') left right 0 foo 7 1 foo 8 2 bar 7 3 bar 8
- pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')[source]
Perform a merge by key distance.
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”.
Optionally match on equivalent keys with ‘by’ before searching with ‘on’.
- Parameters:
left (DataFrame or named Series) –
right (DataFrame or named Series) –
on (label) – Field name to join on. Must be found in both DataFrames. The data MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. On or left_on/right_on must be given.
left_on (label) – Field name to join on in left DataFrame.
right_on (label) – Field name to join on in right DataFrame.
left_index (bool) – Use the index of the left DataFrame as the join key.
right_index (bool) – Use the index of the right DataFrame as the join key.
by (column name or list of column names) – Match on these columns before performing merge operation.
left_by (column name) – Field names to match on in the left DataFrame.
right_by (column name) – Field names to match on in the right DataFrame.
suffixes (2-length sequence (tuple, list, ...)) – Suffix to apply to overlapping column names in the left and right side, respectively.
tolerance (int or Timedelta, optional, default None) – Select asof tolerance within this range; must be compatible with the merge index.
allow_exact_matches (bool, default True) –
If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)
If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than).
direction ('backward' (default), 'forward', or 'nearest') – Whether to search for prior, subsequent, or closest matches.
- Return type:
See also
mergeMerge with a database-style join.
merge_orderedMerge with optional filling/interpolation.
Examples
>>> left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]}) >>> left a left_val 0 1 a 1 5 b 2 10 c
>>> right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]}) >>> right a right_val 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7
>>> pd.merge_asof(left, right, on="a") a left_val right_val 0 1 a 1 1 5 b 3 2 10 c 7
>>> pd.merge_asof(left, right, on="a", allow_exact_matches=False) a left_val right_val 0 1 a NaN 1 5 b 3.0 2 10 c 7.0
>>> pd.merge_asof(left, right, on="a", direction="forward") a left_val right_val 0 1 a 1.0 1 5 b 6.0 2 10 c NaN
>>> pd.merge_asof(left, right, on="a", direction="nearest") a left_val right_val 0 1 a 1 1 5 b 6 2 10 c 7
We can use indexed DataFrames as well.
>>> left = pd.DataFrame({"left_val": ["a", "b", "c"]}, index=[1, 5, 10]) >>> left left_val 1 a 5 b 10 c
>>> right = pd.DataFrame({"right_val": [1, 2, 3, 6, 7]}, index=[1, 2, 3, 6, 7]) >>> right right_val 1 1 2 2 3 3 6 6 7 7
>>> pd.merge_asof(left, right, left_index=True, right_index=True) left_val right_val 1 a 1 5 b 3 10 c 7
Here is a real-world times-series example
>>> quotes = pd.DataFrame( ... { ... "time": [ ... pd.Timestamp("2016-05-25 13:30:00.023"), ... pd.Timestamp("2016-05-25 13:30:00.023"), ... pd.Timestamp("2016-05-25 13:30:00.030"), ... pd.Timestamp("2016-05-25 13:30:00.041"), ... pd.Timestamp("2016-05-25 13:30:00.048"), ... pd.Timestamp("2016-05-25 13:30:00.049"), ... pd.Timestamp("2016-05-25 13:30:00.072"), ... pd.Timestamp("2016-05-25 13:30:00.075") ... ], ... "ticker": [ ... "GOOG", ... "MSFT", ... "MSFT", ... "MSFT", ... "GOOG", ... "AAPL", ... "GOOG", ... "MSFT" ... ], ... "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01], ... "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03] ... } ... ) >>> quotes time ticker bid ask 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
>>> trades = pd.DataFrame( ... { ... "time": [ ... pd.Timestamp("2016-05-25 13:30:00.023"), ... pd.Timestamp("2016-05-25 13:30:00.038"), ... pd.Timestamp("2016-05-25 13:30:00.048"), ... pd.Timestamp("2016-05-25 13:30:00.048"), ... pd.Timestamp("2016-05-25 13:30:00.048") ... ], ... "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"], ... "price": [51.95, 51.95, 720.77, 720.92, 98.0], ... "quantity": [75, 155, 100, 100, 100] ... } ... ) >>> trades time ticker price quantity 0 2016-05-25 13:30:00.023 MSFT 51.95 75 1 2016-05-25 13:30:00.038 MSFT 51.95 155 2 2016-05-25 13:30:00.048 GOOG 720.77 100 3 2016-05-25 13:30:00.048 GOOG 720.92 100 4 2016-05-25 13:30:00.048 AAPL 98.00 100
By default we are taking the asof of the quotes
>>> pd.merge_asof(trades, quotes, on="time", by="ticker") time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 2ms between the quote time and the trade time
>>> pd.merge_asof( ... trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms") ... ) time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time. However prior data will propagate forward
>>> pd.merge_asof( ... trades, ... quotes, ... on="time", ... by="ticker", ... tolerance=pd.Timedelta("10ms"), ... allow_exact_matches=False ... ) time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN 3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
- pandas.merge_ordered(left, right, on=None, left_on=None, right_on=None, left_by=None, right_by=None, fill_method=None, suffixes=('_x', '_y'), how='outer')[source]
Perform a merge for ordered data with optional filling/interpolation.
Designed for ordered data like time series data. Optionally perform group-wise merge (see examples).
- Parameters:
left (DataFrame or named Series) –
right (DataFrame or named Series) –
on (label or list) – Field names to join on. Must be found in both DataFrames.
left_on (label or list, or array-like) – Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns.
right_on (label or list, or array-like) – Field names to join on in right DataFrame or vector/list of vectors per left_on docs.
left_by (column name or list of column names) – Group left DataFrame by group columns and merge piece by piece with right DataFrame. Must be None if either left or right are a Series.
right_by (column name or list of column names) – Group right DataFrame by group columns and merge piece by piece with left DataFrame. Must be None if either left or right are a Series.
fill_method ({'ffill', None}, default None) – Interpolation method for data.
suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
how ({'left', 'right', 'outer', 'inner'}, default 'outer') –
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join).
- Returns:
The merged DataFrame output type will be the same as ‘left’, if it is a subclass of DataFrame.
- Return type:
See also
mergeMerge with a database-style join.
merge_asofMerge on nearest keys.
Examples
>>> from pandas import merge_ordered >>> df1 = pd.DataFrame( ... { ... "key": ["a", "c", "e", "a", "c", "e"], ... "lvalue": [1, 2, 3, 1, 2, 3], ... "group": ["a", "a", "a", "b", "b", "b"] ... } ... ) >>> df1 key lvalue group 0 a 1 a 1 c 2 a 2 e 3 a 3 a 1 b 4 c 2 b 5 e 3 b
>>> df2 = pd.DataFrame({"key": ["b", "c", "d"], "rvalue": [1, 2, 3]}) >>> df2 key rvalue 0 b 1 1 c 2 2 d 3
>>> merge_ordered(df1, df2, fill_method="ffill", left_by="group") key lvalue group rvalue 0 a 1 a NaN 1 b 1 a 1.0 2 c 2 a 2.0 3 d 2 a 3.0 4 e 3 a 3.0 5 a 1 b NaN 6 b 1 b 1.0 7 c 2 b 2.0 8 d 2 b 3.0 9 e 3 b 3.0
- pandas.notna(obj)[source]
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is
NaNin numeric arrays,NoneorNaNin object arrays,NaTin datetimelike).- Parameters:
obj (array-like or object value) – Object to check for not null or non-missing values.
- Returns:
For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is valid.
- Return type:
See also
isnaBoolean inverse of pandas.notna.
Series.notnaDetect valid values in a Series.
DataFrame.notnaDetect valid values in a DataFrame.
Index.notnaDetect valid values in an Index.
Examples
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.notna('dog') True
>>> pd.notna(pd.NA) False
>>> pd.notna(np.nan) False
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]]) >>> array array([[ 1., nan, 3.], [ 4., 5., nan]]) >>> pd.notna(array) array([[ True, False, True], [ True, True, False]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, ... "2017-07-08"]) >>> index DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None) >>> pd.notna(index) array([ True, True, False, True])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']]) >>> df 0 1 2 0 ant bee cat 1 dog None fly >>> pd.notna(df) 0 1 2 0 True True True 1 True False True
>>> pd.notna(df[1]) 0 True 1 False Name: 1, dtype: bool
- pandas.notnull(obj)
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is
NaNin numeric arrays,NoneorNaNin object arrays,NaTin datetimelike).- Parameters:
obj (array-like or object value) – Object to check for not null or non-missing values.
- Returns:
For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is valid.
- Return type:
See also
isnaBoolean inverse of pandas.notna.
Series.notnaDetect valid values in a Series.
DataFrame.notnaDetect valid values in a DataFrame.
Index.notnaDetect valid values in an Index.
Examples
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.notna('dog') True
>>> pd.notna(pd.NA) False
>>> pd.notna(np.nan) False
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]]) >>> array array([[ 1., nan, 3.], [ 4., 5., nan]]) >>> pd.notna(array) array([[ True, False, True], [ True, True, False]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, ... "2017-07-08"]) >>> index DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None) >>> pd.notna(index) array([ True, True, False, True])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']]) >>> df 0 1 2 0 ant bee cat 1 dog None fly >>> pd.notna(df) 0 1 2 0 True True True 1 True False True
>>> pd.notna(df[1]) 0 True 1 False Name: 1, dtype: bool
- class pandas.option_context[source]
Context manager to temporarily set options in the with statement context.
You need to invoke as
option_context(pat, val, [(pat, val), ...]).Examples
>>> from pandas import option_context >>> with option_context('display.max_rows', 10, 'display.max_columns', 5): ... pass
- pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)[source]
Return a fixed frequency PeriodIndex.
The day (calendar) is the default frequency.
- Parameters:
start (str or period-like, default None) – Left bound for generating periods.
end (str or period-like, default None) – Right bound for generating periods.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, optional) – Frequency alias. By default the freq is taken from start or end if those are Period objects. Otherwise, the default is
"D"for daily frequency.name (str, default None) – Name of the resulting PeriodIndex.
- Return type:
Notes
Of the three parameters:
start,end, andperiods, exactly two must be specified.To learn more about the frequency strings, please see this link.
Examples
>>> pd.period_range(start='2017-01-01', end='2018-01-01', freq='M') PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01'], dtype='period[M]')
If
startorendarePeriodobjects, they will be used as anchor endpoints for aPeriodIndexwith frequency matching that of theperiod_rangeconstructor.>>> pd.period_range(start=pd.Period('2017Q1', freq='Q'), ... end=pd.Period('2017Q2', freq='Q'), freq='M') PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'], dtype='period[M]')
- pandas.pivot(data, *, columns, index=typing.Literal[<no_default>], values=typing.Literal[<no_default>])[source]
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.
- Parameters:
data (DataFrame) –
columns (str or object or a list of str) –
Column to use to make new frame’s columns.
Changed in version 1.1.0: Also accept list of columns names.
index (str or object or a list of str, optional) –
Column to use to make new frame’s index. If not given, uses existing index.
Changed in version 1.1.0: Also accept list of index names.
values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
- Returns:
Returns reshaped DataFrame.
- Return type:
- Raises:
ValueError: – When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.
See also
DataFrame.pivot_tableGeneralization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstackPivot based on the index values instead of a column.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', ... 'two'], ... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], ... 'baz': [1, 2, 3, 4, 5, 6], ... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) >>> df foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
>>> df.pivot(index='foo', columns='bar', values='baz') bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar')['baz'] bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo']) baz zoo bar A B C A B C foo one 1 2 3 x y z two 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df = pd.DataFrame({ ... "lev1": [1, 1, 1, 2, 2, 2], ... "lev2": [1, 1, 2, 1, 1, 2], ... "lev3": [1, 2, 1, 2, 1, 2], ... "lev4": [1, 2, 3, 4, 5, 6], ... "values": [0, 1, 2, 3, 4, 5]}) >>> df lev1 lev2 lev3 lev4 values 0 1 1 1 1 0 1 1 1 2 2 1 2 1 2 1 3 2 3 2 1 2 4 3 4 2 1 1 5 4 5 2 2 2 6 5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values") lev2 1 2 lev3 1 2 1 2 lev1 1 0.0 1.0 2.0 NaN 2 4.0 3.0 NaN 5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values") lev3 1 2 lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'A', 'B', 'C'], ... "baz": [1, 2, 3, 4]}) >>> df foo bar baz 0 one A 1 1 one A 2 2 two B 3 3 two C 4
Notice that the first two rows are the same for our index and columns arguments.
>>> df.pivot(index='foo', columns='bar', values='baz') Traceback (most recent call last): ... ValueError: Index contains duplicate entries, cannot reshape
- pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
- Parameters:
data (DataFrame) –
values (list-like or scalar, optional) – Column or columns to aggregate.
index (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc (function, list of functions, dict, default numpy.mean) – If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions. If
margin=True, aggfunc will be used to calculate the partial aggregates.fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table, after aggregation).
margins (bool, default False) – If
margins=True, specialAllcolumns and rows will be added with partial group aggregates across the categories on the rows and columns.dropna (bool, default True) – Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
margins_name (str, default 'All') – Name of the row / column that will contain the totals when margins is True.
observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
sort (bool, default True) –
Specifies if the result should be sorted.
New in version 1.3.0.
- Returns:
An Excel style pivot table.
- Return type:
See also
DataFrame.pivotPivot without aggregation that can handle non-numeric data.
DataFrame.meltUnpivot a DataFrame from wide to long format, optionally leaving identifiers set.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", ... "bar", "bar", "bar", "bar"], ... "B": ["one", "one", "one", "two", "two", ... "one", "one", "two", "two"], ... "C": ["small", "large", "large", "small", ... "small", "large", "small", "small", ... "large"], ... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7], ... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]}) >>> df A B C D E 0 foo one small 1 2 1 foo one large 2 4 2 foo one large 2 5 3 foo two small 3 5 4 foo two small 3 6 5 bar one large 4 6 6 bar one small 5 8 7 bar two small 6 9 8 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc=np.sum) >>> table C large small A B bar one 4.0 5.0 two 7.0 6.0 foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc=np.sum, fill_value=0) >>> table C large small A B bar one 4 5 two 7 6 foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': np.mean, 'E': np.mean}) >>> table D E A C bar large 5.500000 7.500000 small 5.500000 8.500000 foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given value column.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': np.mean, ... 'E': [min, max, np.mean]}) >>> table D E mean max mean min A C bar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8 foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
- pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')[source]
Quantile-based discretization function.
Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
- Parameters:
x (1d ndarray or Series) –
q (int or list-like of float) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
labels (array or False, default None) – Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.
retbins (bool, optional) – Whether to return the (bins, labels) or not. Can be useful if bins is given as a scalar.
precision (int, optional) – The precision at which to store and display the bins labels.
duplicates ({default 'raise', 'drop'}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques.
- Returns:
out (Categorical or Series or array of integers if labels is False) – The return type (Categorical or Series) depends on the input: a Series of type category if input is a Series else Categorical. Bins are represented as categories when categorical data is returned.
bins (ndarray of floats) – Returned only if retbins is True.
Notes
Out of bounds values will be NA in the resulting Categorical object
Examples
>>> pd.qcut(range(5), 4) ... [(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]] Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] ...
>>> pd.qcut(range(5), 3, labels=["good", "medium", "bad"]) ... [good, good, medium, bad, bad] Categories (3, object): [good < medium < bad]
>>> pd.qcut(range(5), 4, labels=False) array([0, 0, 1, 2, 3])
- pandas.read_clipboard(sep='\\s+', dtype_backend=_NoDefault.no_default, **kwargs)[source]
Read text from clipboard and pass to read_csv.
- Parameters:
sep (str, default 's+') – A string or regex delimiter. The default of ‘s+’ denotes one or more whitespace characters.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
**kwargs – See read_csv for the full argument list.
- Returns:
A parsed DataFrame object.
- Return type:
- pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)[source]
Read a comma-separated values (csv) file into DataFrame.
Also supports optionally iterating or breaking of the file into chunks.
Additional help can be found in the online docs for IO Tools.
- Parameters:
filepath_or_buffer (str, path object or file-like object) –
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any
os.PathLike.By file-like object, we refer to objects with a
read()method, such as a file handle (e.g. via builtinopenfunction) orStringIO.sep (str, default ',') – Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool,
csv.Sniffer. In addition, separators longer than 1 character and different from'\s+'will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example:'\r\t'.delimiter (str, default
None) – Alias for sep.header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None. Explicitly passheader=0to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True, soheader=0denotes the first line of data rather than the first line of the file.names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass
header=0to override the column names. Duplicates in this list are not allowed.index_col (int, str, sequence of int / str, or False, optional, default
None) –Column(s) to use as the row labels of the
DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.Note:
index_col=Falsecan be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.usecols (list-like or callable, optional) –
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If
namesare given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be[0, 1, 2]or['foo', 'bar', 'baz']. Element order is ignored, sousecols=[0, 1]is the same as[1, 0]. To instantiate a DataFrame fromdatawith element order preserved usepd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]for columns in['foo', 'bar']order orpd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]for['bar', 'foo']order.If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be
lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.dtype (Type name or dict of column -> type, optional) –
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
New in version 1.5.0: Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.
engine ({'c', 'python', 'pyarrow'}, optional) –
Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.
New in version 1.4.0: The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.
converters (dict, optional) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
true_values (list, optional) – Values to consider as True in addition to case-insensitive variants of “True”.
false_values (list, optional) – Values to consider as False in addition to case-insensitive variants of “False”.
skipinitialspace (bool, default False) – Skip spaces after delimiter.
skiprows (list-like, int or callable, optional) –
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be
lambda x: x in [0, 2].skipfooter (int, default 0) – Number of lines at bottom of file to skip (Unsupported with engine=’c’).
nrows (int, optional) – Number of rows of file to read. Useful for reading pieces of large files.
na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na (bool, default True) –
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.
skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.
parse_dates (bool or list of int or names or list of lists or dict, default False) –
The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetimeafterpd.read_csv.Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format (bool, default False) –
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.
keep_date_col (bool, default False) – If True and parse_dates specifies combining multiple columns then keep the original columns.
date_parser (function, optional) –
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses
dateutil.parser.parserto do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.Deprecated since version 2.0.0: Use
date_formatinstead, or read in asobjectand then applyto_datetime()as-needed.date_format (str or dict of column -> format, default
None) –If used in conjunction with
parse_dates, will parse dates according to this format. For anything more complex, please read in asobjectand then applyto_datetime()as-needed.New in version 2.0.0.
dayfirst (bool, default False) – DD/MM format dates, international and European format.
cache_dates (bool, default True) – If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.
iterator (bool, default False) –
Return TextFileReader object for iteration or getting chunks with
get_chunk().Changed in version 1.2:
TextFileReaderis a context manager.chunksize (int, optional) –
Return TextFileReader object for iteration. See the IO Tools docs for more information on
iteratorandchunksize.Changed in version 1.2:
TextFileReaderis a context manager.compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
thousands (str, optional) – Thousands separator.
decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator (str (length 1), optional) – Character to break file into lines. Only valid with C parser.
quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting (int or csv.QUOTE_* instance, default 0) – Control field quoting behavior per
csv.QUOTE_*constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).doublequote (bool, default
True) – When quotechar is specified and quoting is notQUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a singlequotecharelement.escapechar (str (length 1), optional) – One-character string used to escape other characters.
comment (str, optional) – Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, ifcomment='#', parsing#empty\na,b,c\n1,2,3withheader=0will result in ‘a,b,c’ being treated as the header.encoding (str, optional, default "utf-8") –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings .
Changed in version 1.2: When
encodingisNone,errors="replace"is passed toopen(). Otherwise,errors="strict"is passed toopen(). This behavior was previously only the case forengine="python".Changed in version 1.3.0:
encoding_errorsis a new argument.encodinghas no longer an influence on how encoding errors are handled.encoding_errors (str, optional, default "strict") –
How encoding errors are treated. List of possible values .
New in version 1.3.0.
dialect (str or csv.Dialect, optional) – If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.
on_bad_lines ({'error', 'warn', 'skip'} or callable, default 'error') –
Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :
’error’, raise an Exception when a bad line is encountered.
’warn’, raise a warning when a bad line is encountered and skip that line.
’skip’, skip bad lines without raising or warning when they are encountered.
New in version 1.3.0.
New in version 1.4.0:
callable, function with signature
(bad_line: list[str]) -> list[str] | Nonethat will process a single bad line.bad_lineis a list of strings split by thesep. If the function returnsNone, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, aParserWarningwill be emitted while dropping extra elements. Only supported whenengine="python"
delim_whitespace (bool, default False) – Specifies whether or not whitespace (e.g.
' 'or' ') will be used as the sep. Equivalent to settingsep='\s+'. If this option is set to True, nothing should be passed in for thedelimiterparameter.low_memory (bool, default True) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).
memory_map (bool, default False) – If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
float_precision (str, optional) –
Specifies which converter the C engine should use for floating-point values. The options are
Noneor ‘high’ for the ordinary converter, ‘legacy’ for the original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.Changed in version 1.2.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.
- Return type:
DataFrame or TextFileReader
See also
Examples
>>> pd.read_csv('data.csv')
- pandas.read_excel(io, sheet_name=0, *, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=_NoDefault.no_default, date_format=None, thousands=None, decimal='.', comment=None, skipfooter=0, storage_options=None, dtype_backend=_NoDefault.no_default)[source]
Read an Excel file into a pandas DataFrame.
Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.
- Parameters:
io (str, bytes, ExcelFile, xlrd.Book, path object, or file-like object) –
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:
file://localhost/path/to/table.xlsx.If you want to pass in a path object, pandas accepts any
os.PathLike.By file-like object, we refer to objects with a
read()method, such as a file handle (e.g. via builtinopenfunction) orStringIO.sheet_name (str, int, list, or None, default 0) –
Strings are used for sheet names. Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position). Lists of strings/integers are used to request multiple sheets. Specify None to get all worksheets.
Available cases:
Defaults to
0: 1st sheet as a DataFrame1: 2nd sheet as a DataFrame"Sheet1": Load sheet with name “Sheet1”[0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrameNone: All worksheets.
header (int, list of int, default 0) – Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a
MultiIndex. Use None if there is no header.names (array-like, default None) – List of column names to use. If file contains no header row, then you should explicitly pass header=None.
index_col (int, list of int, default None) –
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a
MultiIndex. If a subset of data is selected withusecols, index_col is based on the subset.Missing values will be forward filled to allow roundtripping with
to_excelformerged_cells=True. To avoid forward filling the missing values useset_indexafter reading the data instead ofindex_col.usecols (str, list-like, or callable, default None) –
If None, then parse all columns.
If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
If list of int, then indicates list of column numbers to be parsed (0-indexed).
If list of string, then indicates list of column names to be parsed.
If callable, then evaluate each column name against it and parse the column if the callable returns
True.
Returns a subset of the columns according to behavior above.
dtype (Type name or dict of column -> type, default None) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
engine (str, default None) –
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
”xlrd” supports old-style Excel files (.xls).
”openpyxl” supports newer Excel file formats.
”odf” supports OpenDocument file formats (.odf, .ods, .odt).
”pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style
.xlsfiles. Whenengine=None, the following logic will be used to determine the engine:If
path_or_bufferis an OpenDocument format (.odf, .ods, .odt), then odf will be used.Otherwise if
path_or_bufferis an xls format,xlrdwill be used.Otherwise if
path_or_bufferis in xlsb format,pyxlsbwill be used.New in version 1.3.0.
Otherwise
openpyxlwill be used.Changed in version 1.3.0.
converters (dict, default None) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
true_values (list, default None) – Values to consider as True.
false_values (list, default None) – Values to consider as False.
skiprows (list-like, int, or callable, optional) – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be
lambda x: x in [0, 2].nrows (int, default None) – Number of rows to parse.
na_values (scalar, str, list-like, or dict, default None) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na (bool, default True) –
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.
parse_dates (bool, list-like, or dict, default False) –
The behavior is as follows:
bool. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”. For non-standard datetime parsing, use
pd.to_datetimeafterpd.read_excel.Note: A fast-path exists for iso8601-formatted dates.
date_parser (function, optional) –
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses
dateutil.parser.parserto do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.Deprecated since version 2.0.0: Use
date_formatinstead, or read in asobjectand then applyto_datetime()as-needed.date_format (str or dict of column -> format, default
None) –If used in conjunction with
parse_dates, will parse dates according to this format. For anything more complex, please read in asobjectand then applyto_datetime()as-needed.New in version 2.0.0.
thousands (str, default None) – Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
decimal (str, default '.') –
Character to recognize as decimal point for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.(e.g. use ‘,’ for European data).
New in version 1.4.0.
comment (str, default None) – Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored.
skipfooter (int, default 0) – Rows at the end to skip (0-indexed).
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
DataFrame from the passed in Excel file. See notes in sheet_name argument for more information on when a dict of DataFrames is returned.
- Return type:
See also
Examples
The file can be read using the file name as string or an open file object:
>>> pd.read_excel('tmp.xlsx', index_col=0) Name Value 0 string1 1 1 string2 2 2 #Comment 3
>>> pd.read_excel(open('tmp.xlsx', 'rb'), ... sheet_name='Sheet3') Unnamed: 0 Name Value 0 0 string1 1 1 1 string2 2 2 2 #Comment 3
Index and header can be specified via the index_col and header arguments
>>> pd.read_excel('tmp.xlsx', index_col=None, header=None) 0 1 2 0 NaN Name Value 1 0.0 string1 1 2 1.0 string2 2 3 2.0 #Comment 3
Column types are inferred but can be explicitly specified
>>> pd.read_excel('tmp.xlsx', index_col=0, ... dtype={'Name': str, 'Value': float}) Name Value 0 string1 1.0 1 string2 2.0 2 #Comment 3.0
True, False, and NA values, and thousands separators have defaults, but can be explicitly specified, too. Supply the values you would like as strings or lists of strings!
>>> pd.read_excel('tmp.xlsx', index_col=0, ... na_values=['string1', 'string2']) Name Value 0 NaN 1 1 NaN 2 2 #Comment 3
Comment lines in the excel input file can be skipped using the comment kwarg
>>> pd.read_excel('tmp.xlsx', index_col=0, comment='#') Name Value 0 string1 1.0 1 string2 2.0 2 None NaN
- pandas.read_feather(path, columns=None, use_threads=True, storage_options=None, dtype_backend=_NoDefault.no_default)[source]
Load a feather-format object from the file path.
- Parameters:
path (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binaryread()function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.feather.columns (sequence, default None) – If not provided, all columns are read.
use_threads (bool, default True) – Whether to parallelize reading using multiple threads.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Return type:
type of object stored in file
- pandas.read_fwf(filepath_or_buffer, *, colspecs='infer', widths=None, infer_nrows=100, dtype_backend=_NoDefault.no_default, **kwds)[source]
Read a table of fixed-width formatted lines into DataFrame.
Also supports optionally iterating or breaking of the file into chunks.
Additional help can be found in the online docs for IO Tools.
- Parameters:
filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a textread()function.The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.csv.colspecs (list of tuple (int, int) or 'infer'. optional) – A list of tuples giving the extents of the fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications from the first 100 rows of the data which are not being skipped via skiprows (default=’infer’).
widths (list of int, optional) – A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
infer_nrows (int, default 100) – The number of rows to consider when letting the parser determine the colspecs.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
**kwds (optional) – Optional keyword arguments can be passed to
TextFileReader.
- Returns:
A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.
- Return type:
DataFrame or TextFileReader
See also
DataFrame.to_csvWrite DataFrame to a comma-separated values (csv) file.
read_csvRead a comma-separated values (csv) file into DataFrame.
Examples
>>> pd.read_fwf('data.csv')
- pandas.read_gbq(query, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=True, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, max_results=None, progress_bar_type=None)[source]
Load data from Google BigQuery.
This function requires the pandas-gbq package.
See the How to authenticate with Google BigQuery guide for authentication instructions.
- Parameters:
query (str) – SQL-Like Query to return data values.
project_id (str, optional) – Google BigQuery Account project ID. Optional when available from the environment.
index_col (str, optional) – Name of result column to use for index in results DataFrame.
col_order (list(str), optional) – List of BigQuery column names in the desired order for results DataFrame.
reauth (bool, default False) – Force Google BigQuery to re-authenticate the user. This is useful if multiple accounts are used.
auth_local_webserver (bool, default True) –
Use the local webserver flow instead of the console flow when getting user credentials.
New in version 0.2.0 of pandas-gbq.
Changed in version 1.5.0: Default value is changed to
True. Google has deprecated theauth_local_webserver = False“out of band” (copy-paste) flow.dialect (str, default 'legacy') –
Note: The default value is changing to ‘standard’ in a future version.
SQL syntax dialect to use. Value can be one of:
'legacy'Use BigQuery’s legacy SQL dialect. For more information see BigQuery Legacy SQL Reference.
'standard'Use BigQuery’s standard SQL, which is compliant with the SQL 2011 standard. For more information see BigQuery Standard SQL Reference.
location (str, optional) –
Location where the query job should run. See the BigQuery locations documentation for a list of available locations. The location must match that of any datasets used in the query.
New in version 0.5.0 of pandas-gbq.
configuration (dict, optional) –
Query config parameters for job processing. For example:
configuration = {‘query’: {‘useQueryCache’: False}}
For more information see BigQuery REST API Reference.
credentials (google.auth.credentials.Credentials, optional) –
Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentialsor Service Accountgoogle.oauth2.service_account.Credentialsdirectly.New in version 0.8.0 of pandas-gbq.
use_bqstorage_api (bool, default False) –
Use the BigQuery Storage API to download query results quickly, but at an increased cost. To use this API, first enable it in the Cloud Console. You must also have the bigquery.readsessions.create permission on the project you are billing queries to.
This feature requires version 0.10.0 or later of the
pandas-gbqpackage. It also requires thegoogle-cloud-bigquery-storageandfastavropackages.max_results (int, optional) –
If set, limit the maximum number of rows to fetch from the query results.
New in version 0.12.0 of pandas-gbq.
New in version 1.1.0.
progress_bar_type (Optional, str) –
If set, use the tqdm library to display a progress bar while the data downloads. Install the
tqdmpackage to use this feature.Possible values of
progress_bar_typeinclude:NoneNo progress bar.
'tqdm'Use the
tqdm.tqdm()function to print a progress bar tosys.stderr.'tqdm_notebook'Use the
tqdm.tqdm_notebook()function to display a progress bar as a Jupyter notebook widget.'tqdm_gui'Use the
tqdm.tqdm_gui()function to display a progress bar as a graphical dialog box.
Note that this feature requires version 0.12.0 or later of the
pandas-gbqpackage. And it requires thetqdmpackage. Slightly different thanpandas-gbq, here the default isNone.
- Returns:
df – DataFrame representing results of query.
- Return type:
See also
pandas_gbq.read_gbqThis function in the pandas-gbq library.
DataFrame.to_gbqWrite a DataFrame to Google BigQuery.
- pandas.read_hdf(path_or_buf, key=None, mode='r', errors='strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)[source]
Read from the store, close it if we opened it.
Retrieve pandas object stored in file, optionally based on where criteria.
Warning
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- Parameters:
path_or_buf (str, path object, pandas.HDFStore) –
Any valid string path is acceptable. Only supports the local file system, remote URLs and file-like objects are not supported.
If you want to pass in a path object, pandas accepts any
os.PathLike.Alternatively, pandas accepts an open
pandas.HDFStoreobject.key (object, optional) – The group identifier in the store. Can be omitted if the HDF file contains a single pandas object.
mode ({'r', 'r+', 'a'}, default 'r') – Mode to use when opening the file. Ignored if path_or_buf is a
pandas.HDFStore. Default is ‘r’.errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()for a full list of options.where (list, optional) – A list of Term (or convertible) objects.
start (int, optional) – Row number to start selection.
stop (int, optional) – Row number to stop selection.
columns (list, optional) – A list of columns names to return.
iterator (bool, optional) – Return an iterator object.
chunksize (int, optional) – Number of rows to include in an iteration when using an iterator.
**kwargs – Additional keyword arguments passed to HDFStore.
- Returns:
The selected object. Return type depends on the object stored.
- Return type:
See also
DataFrame.to_hdfWrite a HDF file from a DataFrame.
HDFStoreLow-level access to HDF files.
Examples
>>> df = pd.DataFrame([[1, 1.0, 'a']], columns=['x', 'y', 'z']) >>> df.to_hdf('./store.h5', 'data') >>> reread = pd.read_hdf('./store.h5')
- pandas.read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, extract_links=None, dtype_backend=_NoDefault.no_default)[source]
Read HTML tables into a
listofDataFrameobjects.- Parameters:
io (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a stringread()function. The string can represent a URL or the HTML itself. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with'https'you might try removing the's'.match (str or compiled regular expression, optional) – The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.
flavor (str, optional) – The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of
Nonetries to uselxmlto parse and if that fails it falls back onbs4+html5lib.header (int or list-like, optional) – The row (or list of rows for a
MultiIndex) to use to make the columns headers.index_col (int or list-like, optional) – The column (or list of columns) to use to create the index.
skiprows (int, list-like or slice, optional) – Number of rows to skip after parsing the column integer. 0-based. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.
attrs (dict, optional) –
This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example,
attrs = {'id': 'table'}
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document.
attrs = {'asdf': 'table'}
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.
parse_dates (bool, optional) – See
read_csv()for more details.thousands (str, optional) – Separator to use to parse thousands. Defaults to
','.encoding (str, optional) – The encoding used to decode the web page. Defaults to
None.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).
converters (dict, default None) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.
na_values (iterable, default None) – Custom NA values.
keep_default_na (bool, default True) – If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
displayed_only (bool, default True) – Whether elements with “display: none” should be parsed.
extract_links ({None, "all", "header", "body", "footer"}) –
Table elements in the specified section(s) with <a> tags will have their href extracted.
New in version 1.5.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
A list of DataFrames.
- Return type:
dfs
See also
read_csvRead a comma-separated values (csv) file into DataFrame.
Notes
Before using this function you should read the gotchas about the HTML parsing libraries.
Expect to do some cleanup after you call this function. For example, you might need to manually assign column names if the column names are converted to NaN when you pass the header=0 argument. We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.
This function searches for
<table>elements and only for<tr>and<th>rows and<td>elements within each<tr>or<th>element in the table.<td>stands for “table data”. This function attempts to properly handlecolspanandrowspanattributes. If the function has a<thead>argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only<th>elements into the header).Similar to
read_csv()the header argument is applied after skiprows is applied.This function will always return a list of
DataFrameor it will fail, e.g., it will not return an empty list.Examples
See the read_html documentation in the IO section of the docs for some examples of reading in HTML tables.
- pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, keep_default_dates=True, precise_float=False, date_unit=None, encoding=None, encoding_errors='strict', lines=False, chunksize=None, compression='infer', nrows=None, storage_options=None, dtype_backend=_NoDefault.no_default, engine='ujson')[source]
Convert a JSON string to pandas object.
- Parameters:
path_or_buf (a valid JSON str, path object or file-like object) –
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:
file://localhost/path/to/table.json.If you want to pass in a path object, pandas accepts any
os.PathLike.By file-like object, we refer to objects with a
read()method, such as a file handle (e.g. via builtinopenfunction) orStringIO.orient (str, optional) –
Indication of expected JSON string format. Compatible JSON strings can be produced by
to_json()with a corresponding orient value. The set of possible orients is:'split': dict like{index -> [index], columns -> [columns], data -> [values]}'records': list like[{column -> value}, ... , {column -> value}]'index': dict like{index -> {column -> value}}'columns': dict like{column -> {index -> value}}'values': just the values array
The allowed and default values depend on the value of the typ parameter.
when
typ == 'series',allowed orients are
{'split','records','index'}default is
'index'The Series index must be unique for orient
'index'.
when
typ == 'frame',allowed orients are
{'split','records','index', 'columns','values', 'table'}default is
'columns'The DataFrame index must be unique for orients
'index'and'columns'.The DataFrame columns must be unique for orients
'index','columns', and'records'.
typ ({'frame', 'series'}, default 'frame') – The type of object to recover.
dtype (bool or dict, default None) –
If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don’t infer dtypes at all, applies only to the data.
For all
orientvalues except'table', default is True.convert_axes (bool, default None) –
Try to convert the axes to the proper dtypes.
For all
orientvalues except'table', default is True.convert_dates (bool or list of str, default True) – If True then default datelike columns may be converted (depending on keep_default_dates). If False, no dates will be converted. If a list of column names, then those columns will be converted and default datelike columns may also be converted (depending on keep_default_dates).
keep_default_dates (bool, default True) –
If parsing dates (convert_dates is not False), then try to parse the default datelike columns. A column label is datelike if
it ends with
'_at',it ends with
'_time',it begins with
'timestamp',it is
'modified', orit is
'date'.
precise_float (bool, default False) – Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
date_unit (str, default None) – The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.
encoding (str, default is 'utf-8') – The encoding to use to decode py3 bytes.
encoding_errors (str, optional, default "strict") –
How encoding errors are treated. List of possible values .
New in version 1.3.0.
lines (bool, default False) – Read the file as a json object per line.
chunksize (int, optional) –
Return JsonReader object for iteration. See the line-delimited json docs for more information on
chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.Changed in version 1.2:
JsonReaderis a context manager.compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
nrows (int, optional) –
The number of lines from the line-delimited jsonfile that has to be read. This can only be passed if lines=True. If this is None, all the rows will be returned.
New in version 1.1.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
engine ({"ujson", "pyarrow"}, default "ujson") –
Parser engine to use. The
"pyarrow"engine is only available whenlines=True.New in version 2.0.
- Returns:
The type returned depends on the value of typ.
- Return type:
See also
DataFrame.to_jsonConvert a DataFrame to a JSON string.
Series.to_jsonConvert a Series to a JSON string.
json_normalizeNormalize semi-structured JSON data into a flat table.
Notes
Specific to
orient='table', if aDataFramewith a literalIndexname of index gets written withto_json(), the subsequent read operation will incorrectly set theIndexname toNone. This is because index is also used byDataFrame.to_json()to denote a missingIndexname, and the subsequentread_json()operation cannot distinguish between the two. The same limitation is encountered with aMultiIndexand any names beginning with'level_'.Examples
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2'])
Encoding/decoding a Dataframe using
'split'formatted JSON:>>> df.to_json(orient='split') '{"columns":["col 1","col 2"],"index":["row 1","row 2"],"data":[["a","b"],["c","d"]]}' >>> pd.read_json(_, orient='split') col 1 col 2 row 1 a b row 2 c d
Encoding/decoding a Dataframe using
'index'formatted JSON:>>> df.to_json(orient='index') '{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index') col 1 col 2 row 1 a b row 2 c d
Encoding/decoding a Dataframe using
'records'formatted JSON. Note that index labels are not preserved with this encoding.>>> df.to_json(orient='records') '[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]' >>> pd.read_json(_, orient='records') col 1 col 2 0 a b 1 c d
Encoding with Table Schema
>>> df.to_json(orient='table') '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"col 1","type":"string"},{"name":"col 2","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":"row 1","col 1":"a","col 2":"b"},{"index":"row 2","col 1":"c","col 2":"d"}]}'
- pandas.read_orc(path, columns=None, dtype_backend=_NoDefault.no_default, **kwargs)[source]
Load an ORC object from the file path, returning a DataFrame.
- Parameters:
path (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binaryread()function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.orc.columns (list, default None) – If not None, only these columns will be read from the file. Output always follows the ordering of the file and not the columns list. This mirrors the original behaviour of .
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
**kwargs – Any additional kwargs are passed to pyarrow.
- Return type:
Notes
Before using this function you should read the user guide about ORC and install optional dependencies.
- pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=_NoDefault.no_default, dtype_backend=_NoDefault.no_default, **kwargs)[source]
Load a parquet object from the file path, returning a DataFrame.
- Parameters:
path (str, path object or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binaryread()function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be:file://localhost/path/to/tablesors3://bucket/partition_dir.engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engineis used. The defaultio.parquet.enginebehavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.columns (list, default=None) – If not None, only these columns will be read from the file.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.3.0.
use_nullable_dtypes (bool, default False) –
If True, use dtypes that use
pd.NAas missing value indicator for the resulting DataFrame. (only applicable for thepyarrowengine) As new dtypes are added that supportpd.NAin the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.Deprecated since version 2.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
**kwargs – Any additional kwargs are passed to the engine.
- Return type:
- pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)[source]
Load pickled pandas object (or any object) from file.
Warning
Loading pickled data received from untrusted sources can be unsafe. See here.
- Parameters:
filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binaryreadlines()function. Also accepts URL. URL is not limited to S3 and GCS.compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
- Return type:
same type as object stored in file
See also
DataFrame.to_picklePickle (serialize) DataFrame object to file.
Series.to_picklePickle (serialize) Series object to file.
read_hdfRead HDF5 file into a DataFrame.
read_sqlRead SQL query or database table into a DataFrame.
read_parquetLoad a parquet object, returning a DataFrame.
Notes
read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3 provided the object was serialized with to_pickle.
Examples
>>> original_df = pd.DataFrame( ... {"foo": range(5), "bar": range(5, 10)} ... ) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl") >>> unpickled_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9
- pandas.read_sas(filepath_or_buffer, *, format=None, index=None, encoding=None, chunksize=None, iterator=False, compression='infer')[source]
Read SAS files stored as either XPORT or SAS7BDAT format files.
- Parameters:
filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binaryread()function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.sas7bdat.format (str {'xport', 'sas7bdat'} or None) – If None, file format is inferred from file extension. If ‘xport’ or ‘sas7bdat’, uses the corresponding format.
index (identifier of index column, defaults to None) – Identifier of column that should be used as index of the DataFrame.
encoding (str, default is None) – Encoding for text data. If None, text data are stored as raw bytes.
chunksize (int) –
Read file chunksize lines at a time, returns iterator.
Changed in version 1.2:
TextFileReaderis a context manager.iterator (bool, defaults to False) –
If True, returns an iterator for reading the file incrementally.
Changed in version 1.2:
TextFileReaderis a context manager.compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
- Returns:
DataFrame if iterator=False and chunksize=None, else SAS7BDATReader
or XportReader
- Return type:
DataFrame | ReaderBase
- pandas.read_spss(path, usecols=None, convert_categoricals=True, dtype_backend=_NoDefault.no_default)[source]
Load an SPSS file from the file path, returning a DataFrame.
- Parameters:
path (str or Path) – File path.
usecols (list-like, optional) – Return a subset of the columns. If None, return all columns.
convert_categoricals (bool, default is True) – Convert categorical columns into pd.Categorical.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Return type:
- pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default, dtype=None)[source]
Read SQL query or database table into a DataFrame.
This function is a convenience wrapper around
read_sql_tableandread_sql_query(for backward compatibility). It will delegate to the specific function depending on the provided input. A SQL query will be routed toread_sql_query, while a database table name will be routed toread_sql_table. Note that the delegated function might have more specific notes about their functionality not listed here.- Parameters:
sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.
con (SQLAlchemy connectable, str, or sqlite3 connection) –
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable; str connections are closed automatically. See here.
index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).
coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
params (list, tuple or dict, optional, default: None) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}.
parse_dates (list or dict, default: None) –
List of column names to parse as dates.
Dict of
{column_name: format string}where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.Dict of
{column_name: arg dict}, where the arg dict corresponds to the keyword arguments ofpandas.to_datetime()Especially useful with databases without native Datetime support, such as SQLite.
columns (list, default: None) – List of column names to select from SQL table (only used when reading a table).
chunksize (int, default None) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
dtype (Type name or dict of columns) –
Data type for data or columns. E.g. np.float64 or {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}. The argument is ignored if a table is passed instead of a query.
New in version 2.0.0.
- Return type:
See also
read_sql_tableRead SQL database table into a DataFrame.
read_sql_queryRead SQL query into a DataFrame.
Examples
Read data from SQL via either a SQL query or a SQL tablename. When using a SQLite database only SQL queries are accepted, providing only the SQL tablename will result in an error.
>>> from sqlite3 import connect >>> conn = connect(':memory:') >>> df = pd.DataFrame(data=[[0, '10/11/12'], [1, '12/11/10']], ... columns=['int_column', 'date_column']) >>> df.to_sql('test_data', conn) 2
>>> pd.read_sql('SELECT int_column, date_column FROM test_data', conn) int_column date_column 0 0 10/11/12 1 1 12/11/10
>>> pd.read_sql('test_data', 'postgres:///db_name')
Apply date parsing to columns through the
parse_datesargument Theparse_datesargument callspd.to_datetimeon the provided columns. Custom argument values for applyingpd.to_datetimeon a column are specified via a dictionary format:>>> pd.read_sql('SELECT int_column, date_column FROM test_data', ... conn, ... parse_dates={"date_column": {"format": "%d/%m/%y"}}) int_column date_column 0 0 2012-11-10 1 1 2010-11-12
- pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, dtype_backend=_NoDefault.no_default)[source]
Read SQL query into a DataFrame.
Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.
- Parameters:
sql (str SQL query or SQLAlchemy Selectable (select or text object)) – SQL query to be executed.
con (SQLAlchemy connectable, str, or sqlite3 connection) – Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).
coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Useful for SQL result sets.
params (list, tuple or dict, optional, default: None) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}.
parse_dates (list or dict, default: None) –
List of column names to parse as dates.
Dict of
{column_name: format string}where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.Dict of
{column_name: arg dict}, where the arg dict corresponds to the keyword arguments ofpandas.to_datetime()Especially useful with databases without native Datetime support, such as SQLite.
chunksize (int, default None) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.
dtype (Type name or dict of columns) –
Data type for data or columns. E.g. np.float64 or {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.
New in version 1.3.0.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Return type:
See also
read_sql_tableRead SQL database table into a DataFrame.
read_sqlRead SQL query or database table into a DataFrame.
Notes
Any datetime values with time zone information parsed via the parse_dates parameter will be converted to UTC.
- pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default)[source]
Read SQL database table into a DataFrame.
Given a table name and a SQLAlchemy connectable, returns a DataFrame. This function does not support DBAPI connections.
- Parameters:
table_name (str) – Name of SQL table in database.
con (SQLAlchemy connectable or str) – A database URI could be provided as str. SQLite DBAPI connection mode not supported.
schema (str, default None) – Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default).
index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).
coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision.
parse_dates (list or dict, default None) –
List of column names to parse as dates.
Dict of
{column_name: format string}where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.Dict of
{column_name: arg dict}, where the arg dict corresponds to the keyword arguments ofpandas.to_datetime()Especially useful with databases without native Datetime support, such as SQLite.
columns (list, default None) – List of column names to select from SQL table.
chunksize (int, default None) – If specified, returns an iterator where chunksize is the number of rows to include in each chunk.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
A SQL table is returned as two-dimensional data structure with labeled axes.
- Return type:
See also
read_sql_queryRead SQL query into a DataFrame.
read_sqlRead SQL query or database table into a DataFrame.
Notes
Any datetime values with time zone information will be converted to UTC.
Examples
>>> pd.read_sql_table('table_name', 'postgres:///db_name')
- pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)[source]
Read Stata file into DataFrame.
- Parameters:
filepath_or_buffer (str, path object or file-like object) –
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be:
file://localhost/path/to/table.dta.If you want to pass in a path object, pandas accepts any
os.PathLike.By file-like object, we refer to objects with a
read()method, such as a file handle (e.g. via builtinopenfunction) orStringIO.convert_dates (bool, default True) – Convert date variables to DataFrame time values.
convert_categoricals (bool, default True) – Read value labels and convert columns to Categorical/Factor variables.
index_col (str, optional) – Column to set as index.
convert_missing (bool, default False) – Flag indicating whether to convert missing values to their Stata representations. If False, missing values are replaced with nan. If True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects.
preserve_dtypes (bool, default True) – Preserve Stata datatypes. If False, numeric data are upcast to pandas default types for foreign data (float64 or int64).
columns (list or None) – Columns to retain. Columns will be returned in the given order. None returns all columns.
order_categoricals (bool, default True) – Flag indicating whether converted categorical data are ordered.
chunksize (int, default None) – Return StataReader object for iterations, returns chunks with given number of lines.
iterator (bool, default False) – Return StataReader object.
compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.
- Return type:
DataFrame or StataReader
See also
io.stata.StataReaderLow-level reader for Stata data files.
DataFrame.to_stataExport Stata data files.
Notes
Categorical variables read through an iterator may not have the same categories and dtype. This occurs when a variable stored in a DTA file is associated to an incomplete set of value labels that only label a strict subset of the values.
Examples
Creating a dummy stata for this example
>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta')
Read a Stata dta file:
>>> df = pd.read_stata('animals.dta')
Read a Stata dta file in 10,000 line chunks:
>>> values = np.random.randint(0, 10, size=(20_000, 1), dtype="uint8") >>> df = pd.DataFrame(values, columns=["i"]) >>> df.to_stata('filename.dta')
>>> with pd.read_stata('filename.dta', chunksize=10000) as itr: >>> for chunk in itr: ... # Operate on a single chunk, e.g., chunk.mean() ... pass
- pandas.read_table(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)[source]
Read general delimited file into DataFrame.
Also supports optionally iterating or breaking of the file into chunks.
Additional help can be found in the online docs for IO Tools.
- Parameters:
filepath_or_buffer (str, path object or file-like object) –
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any
os.PathLike.By file-like object, we refer to objects with a
read()method, such as a file handle (e.g. via builtinopenfunction) orStringIO.sep (str, default '\t' (tab-stop)) – Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool,
csv.Sniffer. In addition, separators longer than 1 character and different from'\s+'will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example:'\r\t'.delimiter (str, default
None) – Alias for sep.header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None. Explicitly passheader=0to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True, soheader=0denotes the first line of data rather than the first line of the file.names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass
header=0to override the column names. Duplicates in this list are not allowed.index_col (int, str, sequence of int / str, or False, optional, default
None) –Column(s) to use as the row labels of the
DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.Note:
index_col=Falsecan be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.usecols (list-like or callable, optional) –
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If
namesare given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be[0, 1, 2]or['foo', 'bar', 'baz']. Element order is ignored, sousecols=[0, 1]is the same as[1, 0]. To instantiate a DataFrame fromdatawith element order preserved usepd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]for columns in['foo', 'bar']order orpd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]for['bar', 'foo']order.If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be
lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.dtype (Type name or dict of column -> type, optional) –
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
New in version 1.5.0: Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.
engine ({'c', 'python', 'pyarrow'}, optional) –
Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.
New in version 1.4.0: The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.
converters (dict, optional) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
true_values (list, optional) – Values to consider as True in addition to case-insensitive variants of “True”.
false_values (list, optional) – Values to consider as False in addition to case-insensitive variants of “False”.
skipinitialspace (bool, default False) – Skip spaces after delimiter.
skiprows (list-like, int or callable, optional) –
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be
lambda x: x in [0, 2].skipfooter (int, default 0) – Number of lines at bottom of file to skip (Unsupported with engine=’c’).
nrows (int, optional) – Number of rows of file to read. Useful for reading pieces of large files.
na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na (bool, default True) –
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.
skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.
parse_dates (bool or list of int or names or list of lists or dict, default False) –
The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetimeafterpd.read_csv.Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format (bool, default False) –
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.
keep_date_col (bool, default False) – If True and parse_dates specifies combining multiple columns then keep the original columns.
date_parser (function, optional) –
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses
dateutil.parser.parserto do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.Deprecated since version 2.0.0: Use
date_formatinstead, or read in asobjectand then applyto_datetime()as-needed.date_format (str or dict of column -> format, default
None) –If used in conjunction with
parse_dates, will parse dates according to this format. For anything more complex, please read in asobjectand then applyto_datetime()as-needed.New in version 2.0.0.
dayfirst (bool, default False) – DD/MM format dates, international and European format.
cache_dates (bool, default True) – If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.
iterator (bool, default False) –
Return TextFileReader object for iteration or getting chunks with
get_chunk().Changed in version 1.2:
TextFileReaderis a context manager.chunksize (int, optional) –
Return TextFileReader object for iteration. See the IO Tools docs for more information on
iteratorandchunksize.Changed in version 1.2:
TextFileReaderis a context manager.compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
thousands (str, optional) – Thousands separator.
decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator (str (length 1), optional) – Character to break file into lines. Only valid with C parser.
quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting (int or csv.QUOTE_* instance, default 0) – Control field quoting behavior per
csv.QUOTE_*constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).doublequote (bool, default
True) – When quotechar is specified and quoting is notQUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a singlequotecharelement.escapechar (str (length 1), optional) – One-character string used to escape other characters.
comment (str, optional) – Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, ifcomment='#', parsing#empty\na,b,c\n1,2,3withheader=0will result in ‘a,b,c’ being treated as the header.encoding (str, optional, default "utf-8") –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings .
Changed in version 1.2: When
encodingisNone,errors="replace"is passed toopen(). Otherwise,errors="strict"is passed toopen(). This behavior was previously only the case forengine="python".Changed in version 1.3.0:
encoding_errorsis a new argument.encodinghas no longer an influence on how encoding errors are handled.encoding_errors (str, optional, default "strict") –
How encoding errors are treated. List of possible values .
New in version 1.3.0.
dialect (str or csv.Dialect, optional) – If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.
on_bad_lines ({'error', 'warn', 'skip'} or callable, default 'error') –
Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :
’error’, raise an Exception when a bad line is encountered.
’warn’, raise a warning when a bad line is encountered and skip that line.
’skip’, skip bad lines without raising or warning when they are encountered.
New in version 1.3.0.
New in version 1.4.0:
callable, function with signature
(bad_line: list[str]) -> list[str] | Nonethat will process a single bad line.bad_lineis a list of strings split by thesep. If the function returnsNone, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, aParserWarningwill be emitted while dropping extra elements. Only supported whenengine="python"
delim_whitespace (bool, default False) – Specifies whether or not whitespace (e.g.
' 'or' ') will be used as the sep. Equivalent to settingsep='\s+'. If this option is set to True, nothing should be passed in for thedelimiterparameter.low_memory (bool, default True) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).
memory_map (bool, default False) – If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
float_precision (str, optional) –
Specifies which converter the C engine should use for floating-point values. The options are
Noneor ‘high’ for the ordinary converter, ‘legacy’ for the original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.Changed in version 1.2.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.
- Return type:
DataFrame or TextFileReader
See also
Examples
>>> pd.read_table('data.csv')
- pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, names=None, dtype=None, converters=None, parse_dates=None, encoding='utf-8', parser='lxml', stylesheet=None, iterparse=None, compression='infer', storage_options=None, dtype_backend=_NoDefault.no_default)[source]
Read XML document into a
DataFrameobject.New in version 1.3.0.
- Parameters:
path_or_buffer (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing aread()function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.xpath (str, optional, default './*') – The XPath to parse required set of nodes for migration to DataFrame. XPath should return a collection of elements and not a single element. Note: The
etreeparser supports limited XPath expressions. For more complex XPath, uselxmlwhich requires installation.namespaces (dict, optional) –
The namespaces defined in XML document as dicts with key being namespace prefix and value the URI. There is no need to include all namespaces in XML, only the ones used in
xpathexpression. Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example,namespaces = {"doc": "https://example.com"}
elems_only (bool, optional, default False) – Parse only the child elements at the specified
xpath. By default, all child elements and non-empty text nodes are returned.attrs_only (bool, optional, default False) – Parse only the attributes at the specified
xpath. By default, all attributes are returned.names (list-like, optional) – Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements and attributes.
dtype (Type name or dict of column -> type, optional) –
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
New in version 1.5.0.
converters (dict, optional) –
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
New in version 1.5.0.
parse_dates (bool or list of int or names or list of lists or dict, default False) –
Identifiers to parse index or columns to datetime. The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
New in version 1.5.0.
encoding (str, optional, default 'utf-8') – Encoding of XML document.
parser ({'lxml','etree'}, default 'lxml') – Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
stylesheet (str, path object or file-like object) – A URL, file-like object, or a raw string containing an XSLT script. This stylesheet should flatten complex, deeply nested XML documents for easier parsing. To use this feature you must have
lxmlmodule installed and specify ‘lxml’ asparser. Thexpathmust reference nodes of transformed XML document generated after XSLT transformation and not the original XML document. Only XSLT 1.0 scripts and not later versions is currently supported.iterparse (dict, optional) –
The nodes or attributes to retrieve in iterparsing of XML document as a dict with key being the name of repeating element and value being list of elements or attribute names that are descendants of the repeated element. Note: If this option is used, it will replace
xpathparsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example,iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
New in version 1.5.0.
compression (str or dict, default 'infer') –
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘path_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to
Nonefor no decompression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdDecompressorortarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary:compression={'method': 'zstd', 'dict_data': my_compression_dict}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
A DataFrame.
- Return type:
df
See also
Notes
This method is best designed to import shallow XML documents in following format which is the ideal fit for the two-dimensions of a
DataFrame(row by column).<root> <row> <column1>data</column1> <column2>data</column2> <column3>data</column3> ... </row> <row> ... </row> ... </root>
As a file format, XML documents can be designed any way including layout of elements and attributes as long as it conforms to W3C specifications. Therefore, this method is a convenience handler for a specific flatter design and not all possible XML structures.
However, for more complex XML documents,
stylesheetallows you to temporarily redesign original document with XSLT (a special purpose language) for a flatter version for migration to a DataFrame.This function will always return a single
DataFrameor raise exceptions due to issues with XML document,xpath, or other parameters.See the read_xml documentation in the IO section of the docs for more information in using this method to parse XML files to DataFrames.
Examples
>>> xml = '''<?xml version='1.0' encoding='utf-8'?> ... <data xmlns="http://example.com"> ... <row> ... <shape>square</shape> ... <degrees>360</degrees> ... <sides>4.0</sides> ... </row> ... <row> ... <shape>circle</shape> ... <degrees>360</degrees> ... <sides/> ... </row> ... <row> ... <shape>triangle</shape> ... <degrees>180</degrees> ... <sides>3.0</sides> ... </row> ... </data>'''
>>> df = pd.read_xml(xml) >>> df shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0
>>> xml = '''<?xml version='1.0' encoding='utf-8'?> ... <data> ... <row shape="square" degrees="360" sides="4.0"/> ... <row shape="circle" degrees="360"/> ... <row shape="triangle" degrees="180" sides="3.0"/> ... </data>'''
>>> df = pd.read_xml(xml, xpath=".//row") >>> df shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0
>>> xml = '''<?xml version='1.0' encoding='utf-8'?> ... <doc:data xmlns:doc="https://example.com"> ... <doc:row> ... <doc:shape>square</doc:shape> ... <doc:degrees>360</doc:degrees> ... <doc:sides>4.0</doc:sides> ... </doc:row> ... <doc:row> ... <doc:shape>circle</doc:shape> ... <doc:degrees>360</doc:degrees> ... <doc:sides/> ... </doc:row> ... <doc:row> ... <doc:shape>triangle</doc:shape> ... <doc:degrees>180</doc:degrees> ... <doc:sides>3.0</doc:sides> ... </doc:row> ... </doc:data>'''
>>> df = pd.read_xml(xml, ... xpath="//doc:row", ... namespaces={"doc": "https://example.com"}) >>> df shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0
- pandas.set_eng_float_format(accuracy=3, use_eng_prefix=False)[source]
Format float representation in DataFrame with SI notation.
- Parameters:
- Return type:
None
Examples
>>> df = pd.DataFrame([1e-9, 1e-3, 1, 1e3, 1e6]) >>> df 0 0 1.000000e-09 1 1.000000e-03 2 1.000000e+00 3 1.000000e+03 4 1.000000e+06
>>> pd.set_eng_float_format(accuracy=1) >>> df 0 0 1.0E-09 1 1.0E-03 2 1.0E+00 3 1.0E+03 4 1.0E+06
>>> pd.set_eng_float_format(use_eng_prefix=True) >>> df 0 0 1.000n 1 1.000m 2 1.000 3 1.000k 4 1.000M
>>> pd.set_eng_float_format(accuracy=1, use_eng_prefix=True) >>> df 0 0 1.0n 1 1.0m 2 1.0 3 1.0k 4 1.0M
>>> pd.set_option("display.float_format", None) # unset option
- pandas.show_versions(as_json=False)[source]
Provide useful information, important for bug reports.
It comprises info about hosting operation system, pandas version, and versions of other installed relative packages.
- pandas.test(extra_args=None)[source]
Run the pandas test suite using pytest.
By default, runs with the marks –skip-slow, –skip-network, –skip-db
- pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, *, unit=None)[source]
Return a fixed frequency TimedeltaIndex with day as the default.
- Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).
unit (str, default None) –
Specify the desired resolution of the result.
New in version 2.0.0.
- Return type:
Notes
Of the four parameters
start,end,periods, andfreq, exactly three must be specified. Iffreqis omitted, the resultingTimedeltaIndexwill haveperiodslinearly spaced elements betweenstartandend(closed on both sides).To learn more about the frequency strings, please see this link.
Examples
>>> pd.timedelta_range(start='1 day', periods=4) TimedeltaIndex(['1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')
The
closedparameter specifies which endpoint is included. The default behavior is to include both endpoints.>>> pd.timedelta_range(start='1 day', periods=4, closed='right') TimedeltaIndex(['2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')
The
freqparameter specifies the frequency of the TimedeltaIndex. Only fixed frequencies can be passed, non-fixed frequencies such as ‘M’ (month end) will raise.>>> pd.timedelta_range(start='1 day', end='2 days', freq='6H') TimedeltaIndex(['1 days 00:00:00', '1 days 06:00:00', '1 days 12:00:00', '1 days 18:00:00', '2 days 00:00:00'], dtype='timedelta64[ns]', freq='6H')
Specify
start,end, andperiods; the frequency is generated automatically (linearly spaced).>>> pd.timedelta_range(start='1 day', end='5 days', periods=4) TimedeltaIndex(['1 days 00:00:00', '2 days 08:00:00', '3 days 16:00:00', '5 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
Specify a unit
>>> pd.timedelta_range("1 Day", periods=3, freq="100000D", unit="s") TimedeltaIndex(['1 days 00:00:00', '100001 days 00:00:00', '200001 days 00:00:00'], dtype='timedelta64[s]', freq='100000D')
- pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)[source]
Convert argument to datetime.
This function converts a scalar, array-like,
SeriesorDataFrame/dict-like to a pandas datetime object.- Parameters:
arg (int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like) – The object to convert to a datetime. If a
DataFrameis provided, the method expects minimally the following columns:"year","month","day".errors ({'ignore', 'raise', 'coerce'}, default 'raise') –
If
'raise', then invalid parsing will raise an exception.If
'coerce', then invalid parsing will be set asNaT.If
'ignore', then invalid parsing will return the input.
dayfirst (bool, default False) –
Specify a date parse order if arg is str or is list-like. If
True, parses dates with the day first, e.g."10/11/12"is parsed as2012-11-10.Warning
dayfirst=Trueis not strict, but will prefer to parse with day first.yearfirst (bool, default False) –
Specify a date parse order if arg is str or is list-like.
If
Trueparses dates with the year first, e.g."10/11/12"is parsed as2010-11-12.If both dayfirst and yearfirst are
True, yearfirst is preceded (same asdateutil).
Warning
yearfirst=Trueis not strict, but will prefer to parse with year first.utc (bool, default False) –
Control timezone-related parsing, localization and conversion.
If
True, the function always returns a timezone-aware UTC-localizedTimestamp,SeriesorDatetimeIndex. To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC.If
False(default), inputs will not be coerced to UTC. Timezone-naive inputs will remain naive, while timezone-aware ones will keep their time offsets. Limitations exist for mixed offsets (typically, daylight savings), see Examples section for details.
See also: pandas general documentation about timezone conversion and localization.
format (str, default None) –
The strftime to parse time, e.g.
"%d/%m/%Y". See strftime documentation for more information on choices, though note that"%f"will parse all the way up to nanoseconds. You can also pass:”ISO8601”, to parse any ISO8601 time string (not necessarily in exactly the same format);
”mixed”, to infer the format for each element individually. This is risky, and you should probably use it along with dayfirst.
exact (bool, default True) –
Control how format is used:
If
True, require an exact format match.If
False, allow the format to match anywhere in the target string.
Cannot be used alongside
format='ISO8601'orformat='mixed'.unit (str, default 'ns') – The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with
unit='ms'andorigin='unix', this would calculate the number of milliseconds to the unix epoch start.infer_datetime_format (bool, default False) –
If
Trueand no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.
origin (scalar, default 'unix') –
Define the reference date. The numeric values would be parsed as number of units (defined by unit) since this reference date.
If
'unix'(or POSIX) time; origin is set to 1970-01-01.If
'julian', unit must be'D', and origin is set to beginning of Julian Calendar. Julian day number0is assigned to the day starting at noon on January 1, 4713 BC.If Timestamp convertible (Timestamp, dt.datetime, np.datetimt64 or date string), origin is set to Timestamp identified by origin.
If a float or integer, origin is the millisecond difference relative to 1970-01-01.
cache (bool, default True) – If
True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets. The cache is only used when there are at least 50 values. The presence of out-of-bounds values will render the cache unusable and may slow down parsing.
- Returns:
If parsing succeeded. Return type depends on input (types in parenthesis correspond to fallback in case of unsuccessful timezone or out-of-range timestamp parsing):
scalar:
Timestamp(ordatetime.datetime)array-like:
DatetimeIndex(orSerieswithobjectdtype containingdatetime.datetime)Series:
Seriesofdatetime64dtype (orSeriesofobjectdtype containingdatetime.datetime)DataFrame:
Seriesofdatetime64dtype (orSeriesofobjectdtype containingdatetime.datetime)
- Return type:
datetime
- Raises:
ParserError – When parsing a date from string fails.
ValueError – When another datetime conversion error happens. For example when one of ‘year’, ‘month’, day’ columns is missing in a
DataFrame, or when a Timezone-awaredatetime.datetimeis found in an array-like of mixed time offsets, andutc=False.
See also
DataFrame.astypeCast argument to a specified dtype.
to_timedeltaConvert argument to timedelta.
convert_dtypesConvert dtypes.
Notes
Many input types are supported, and lead to different output types:
scalars can be int, float, str, datetime object (from stdlib
datetimemodule ornumpy). They are converted toTimestampwhen possible, otherwise they are converted todatetime.datetime. None/NaN/null scalars are converted toNaT.array-like can contain int, float, str, datetime objects. They are converted to
DatetimeIndexwhen possible, otherwise they are converted toIndexwithobjectdtype, containingdatetime.datetime. None/NaN/null entries are converted toNaTin both cases.Series are converted to
Serieswithdatetime64dtype when possible, otherwise they are converted toSerieswithobjectdtype, containingdatetime.datetime. None/NaN/null entries are converted toNaTin both cases.DataFrame/dict-like are converted to
Serieswithdatetime64dtype. For each row a datetime is created from assembling the various dataframe columns. Column keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.
The following causes are responsible for
datetime.datetimeobjects being returned (possibly inside anIndexor aSerieswithobjectdtype) instead of a proper pandas designated type (Timestamp,DatetimeIndexorSerieswithdatetime64dtype):when any input element is before
Timestamp.minor afterTimestamp.max, see timestamp limitations.when
utc=False(default) and the input is an array-like orSeriescontaining mixed naive/aware datetime, or aware with mixed time offsets. Note that this happens in the (quite frequent) situation when the timezone has a daylight savings policy. In that case you may wish to useutc=True.
Examples
Handling various input formats
Assembling a datetime from multiple columns of a
DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same>>> df = pd.DataFrame({'year': [2015, 2016], ... 'month': [2, 3], ... 'day': [4, 5]}) >>> pd.to_datetime(df) 0 2015-02-04 1 2016-03-05 dtype: datetime64[ns]
Using a unix epoch time
>>> pd.to_datetime(1490195805, unit='s') Timestamp('2017-03-22 15:16:45') >>> pd.to_datetime(1490195805433502912, unit='ns') Timestamp('2017-03-22 15:16:45.433502912')
Warning
For float arg, precision rounding might happen. To prevent unexpected behavior use a fixed-width exact type.
Using a non-unix epoch origin
>>> pd.to_datetime([1, 2, 3], unit='D', ... origin=pd.Timestamp('1960-01-01')) DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
Differences with strptime behavior
"%f"will parse all the way up to nanoseconds.>>> pd.to_datetime('2018-10-26 12:00:00.0000000011', ... format='%Y-%m-%d %H:%M:%S.%f') Timestamp('2018-10-26 12:00:00.000000001')
Non-convertible date/times
If a date does not meet the timestamp limitations, passing
errors='ignore'will return the original input instead of raising any exception.Passing
errors='coerce'will force an out-of-bounds date toNaT, in addition to forcing non-dates (or non-parseable dates) toNaT.>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore') '13000101' >>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce') NaT
Timezones and time offsets
The default behaviour (
utc=False) is as follows:Timezone-naive inputs are converted to timezone-naive
DatetimeIndex:
>>> pd.to_datetime(['2018-10-26 12:00:00', '2018-10-26 13:00:15']) DatetimeIndex(['2018-10-26 12:00:00', '2018-10-26 13:00:15'], dtype='datetime64[ns]', freq=None)
Timezone-aware inputs with constant time offset are converted to timezone-aware
DatetimeIndex:
>>> pd.to_datetime(['2018-10-26 12:00 -0500', '2018-10-26 13:00 -0500']) DatetimeIndex(['2018-10-26 12:00:00-05:00', '2018-10-26 13:00:00-05:00'], dtype='datetime64[ns, UTC-05:00]', freq=None)
However, timezone-aware inputs with mixed time offsets (for example issued from a timezone with daylight savings, such as Europe/Paris) are not successfully converted to a
DatetimeIndex. Instead a simpleIndexcontainingdatetime.datetimeobjects is returned:
>>> pd.to_datetime(['2020-10-25 02:00 +0200', '2020-10-25 04:00 +0100']) Index([2020-10-25 02:00:00+02:00, 2020-10-25 04:00:00+01:00], dtype='object')
A mix of timezone-aware and timezone-naive inputs is also converted to a simple
Indexcontainingdatetime.datetimeobjects:
>>> from datetime import datetime >>> pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)]) Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')
Setting
utc=Truesolves most of the above issues:Timezone-naive inputs are localized as UTC
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True) DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
Timezone-aware inputs are converted to UTC (the output represents the exact same datetime, but viewed from the UTC time offset +00:00).
>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'], ... utc=True) DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
Inputs can contain both string or datetime, the above rules still apply
>>> pd.to_datetime(['2018-10-26 12:00', datetime(2020, 1, 1, 18)], utc=True) DatetimeIndex(['2018-10-26 12:00:00+00:00', '2020-01-01 18:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
- pandas.to_numeric(arg, errors='raise', downcast=None, dtype_backend=_NoDefault.no_default)[source]
Convert argument to a numeric type.
The default return dtype is float64 or int64 depending on the data supplied. Use the downcast parameter to obtain other dtypes.
Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of ndarray, if numbers smaller than -9223372036854775808 (np.iinfo(np.int64).min) or larger than 18446744073709551615 (np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can be stored in an ndarray. These warnings apply similarly to Series since it internally leverages ndarray.
- Parameters:
arg (scalar, list, tuple, 1-d array, or Series) – Argument to be converted.
errors ({'ignore', 'raise', 'coerce'}, default 'raise') –
If ‘raise’, then invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaN.
If ‘ignore’, then invalid parsing will return the input.
downcast (str, default None) –
Can be ‘integer’, ‘signed’, ‘unsigned’, or ‘float’. If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:
’integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
’unsigned’: smallest unsigned int dtype (min.: np.uint8)
’float’: smallest float dtype (min.: np.float32)
As this behaviour is separate from the core conversion to numeric values, any errors raised during the downcasting will be surfaced regardless of the value of the ‘errors’ input.
In addition, downcasting will only occur if the size of the resulting data’s dtype is strictly larger than the dtype it is to be cast to, so if none of the dtypes checked satisfy that specification, no downcasting will be performed on the data.
dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
New in version 2.0.
- Returns:
Numeric if parsing succeeded. Return type depends on input. Series if Series, otherwise ndarray.
- Return type:
ret
See also
DataFrame.astypeCast argument to a specified dtype.
to_datetimeConvert argument to datetime.
to_timedeltaConvert argument to timedelta.
numpy.ndarray.astypeCast a numpy array to a specified type.
DataFrame.convert_dtypesConvert dtypes.
Examples
Take separate series and convert to numeric, coercing when told to
>>> s = pd.Series(['1.0', '2', -3]) >>> pd.to_numeric(s) 0 1.0 1 2.0 2 -3.0 dtype: float64 >>> pd.to_numeric(s, downcast='float') 0 1.0 1 2.0 2 -3.0 dtype: float32 >>> pd.to_numeric(s, downcast='signed') 0 1 1 2 2 -3 dtype: int8 >>> s = pd.Series(['apple', '1.0', '2', -3]) >>> pd.to_numeric(s, errors='ignore') 0 apple 1 1.0 2 2 3 -3 dtype: object >>> pd.to_numeric(s, errors='coerce') 0 NaN 1 1.0 2 2.0 3 -3.0 dtype: float64
Downcasting of nullable integer and floating dtypes is supported:
>>> s = pd.Series([1, 2, 3], dtype="Int64") >>> pd.to_numeric(s, downcast="integer") 0 1 1 2 2 3 dtype: Int8 >>> s = pd.Series([1.0, 2.1, 3.0], dtype="Float64") >>> pd.to_numeric(s, downcast="float") 0 1.0 1 2.1 2 3.0 dtype: Float32
- pandas.to_pickle(obj, filepath_or_buffer, compression='infer', protocol=5, storage_options=None)[source]
Pickle (serialize) object to file.
- Parameters:
obj (any object) – Any python object.
filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function. Also accepts URL. URL has to be of S3 or GCS.compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
Nonefor no compression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdCompressorortarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
protocol (int) – Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL (see [1], paragraph 12.1.2). The possible values for this parameter depend on the version of Python. For Python 2.x, possible values are 0, 1, 2. For Python>=3.0, 3 is a valid value. For Python >= 3.4, 4 is a valid value. A negative value for the protocol parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.New in version 1.2.0.
- Return type:
None
See also
read_pickleLoad pickled pandas object (or any object) from file.
DataFrame.to_hdfWrite DataFrame to an HDF5 file.
DataFrame.to_sqlWrite DataFrame to a SQL database.
DataFrame.to_parquetWrite a DataFrame to the binary parquet format.
Examples
>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)}) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl") >>> unpickled_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9
- pandas.to_timedelta(arg, unit=None, errors='raise')[source]
Convert argument to timedelta.
Timedeltas are absolute differences in times, expressed in difference units (e.g. days, hours, minutes, seconds). This method converts an argument from a recognized timedelta format / value into a Timedelta type.
- Parameters:
arg (str, timedelta, list-like or Series) –
The data to be converted to timedelta.
Changed in version 2.0: Strings with units ‘M’, ‘Y’ and ‘y’ do not represent unambiguous timedelta values and will raise an exception.
unit (str, optional) –
Denotes the unit of the arg for numeric arg. Defaults to
"ns".Possible values:
’W’
’D’ / ‘days’ / ‘day’
’hours’ / ‘hour’ / ‘hr’ / ‘h’
’m’ / ‘minute’ / ‘min’ / ‘minutes’ / ‘T’
’S’ / ‘seconds’ / ‘sec’ / ‘second’
’ms’ / ‘milliseconds’ / ‘millisecond’ / ‘milli’ / ‘millis’ / ‘L’
’us’ / ‘microseconds’ / ‘microsecond’ / ‘micro’ / ‘micros’ / ‘U’
’ns’ / ‘nanoseconds’ / ‘nano’ / ‘nanos’ / ‘nanosecond’ / ‘N’
Changed in version 1.1.0: Must not be specified when arg context strings and
errors="raise".errors ({'ignore', 'raise', 'coerce'}, default 'raise') –
If ‘raise’, then invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaT.
If ‘ignore’, then invalid parsing will return the input.
- Returns:
If parsing succeeded. Return type depends on input:
list-like: TimedeltaIndex of timedelta64 dtype
Series: Series of timedelta64 dtype
scalar: Timedelta
- Return type:
timedelta
See also
DataFrame.astypeCast argument to a specified dtype.
to_datetimeConvert argument to datetime.
convert_dtypesConvert dtypes.
Notes
If the precision is higher than nanoseconds, the precision of the duration is truncated to nanoseconds for string inputs.
Examples
Parsing a single string to a Timedelta:
>>> pd.to_timedelta('1 days 06:05:01.00003') Timedelta('1 days 06:05:01.000030') >>> pd.to_timedelta('15.5us') Timedelta('0 days 00:00:00.000015500')
Parsing a list or array of strings:
>>> pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan']) TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015500', NaT], dtype='timedelta64[ns]', freq=None)
Converting numbers by specifying the unit keyword argument:
>>> pd.to_timedelta(np.arange(5), unit='s') TimedeltaIndex(['0 days 00:00:00', '0 days 00:00:01', '0 days 00:00:02', '0 days 00:00:03', '0 days 00:00:04'], dtype='timedelta64[ns]', freq=None) >>> pd.to_timedelta(np.arange(5), unit='d') TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
- pandas.unique(values)[source]
Return unique values based on a hash table.
Uniques are returned in order of appearance. This does NOT sort.
Significantly faster than numpy.unique for long enough sequences. Includes NA values.
- Parameters:
values (1d array-like) –
- Returns:
The return can be:
Index : when the input is an Index
Categorical : when the input is a Categorical dtype
ndarray : when the input is a Series/ndarray
Return numpy.ndarray or ExtensionArray.
- Return type:
numpy.ndarray or ExtensionArray
See also
Index.uniqueReturn unique values from an Index.
Series.uniqueReturn unique values of Series object.
Examples
>>> pd.unique(pd.Series([2, 1, 3, 3])) array([2, 1, 3])
>>> pd.unique(pd.Series([2] + [1] * 5)) array([2, 1])
>>> pd.unique(pd.Series([pd.Timestamp("20160101"), pd.Timestamp("20160101")])) array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.unique( ... pd.Series( ... [ ... pd.Timestamp("20160101", tz="US/Eastern"), ... pd.Timestamp("20160101", tz="US/Eastern"), ... ] ... ) ... ) <DatetimeArray> ['2016-01-01 00:00:00-05:00'] Length: 1, dtype: datetime64[ns, US/Eastern]
>>> pd.unique( ... pd.Index( ... [ ... pd.Timestamp("20160101", tz="US/Eastern"), ... pd.Timestamp("20160101", tz="US/Eastern"), ... ] ... ) ... ) DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
>>> pd.unique(list("baabc")) array(['b', 'a', 'c'], dtype=object)
An unordered Categorical will return categories in the order of appearance.
>>> pd.unique(pd.Series(pd.Categorical(list("baabc")))) ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c']
>>> pd.unique(pd.Series(pd.Categorical(list("baabc"), categories=list("abc")))) ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c']
An ordered Categorical preserves the category ordering.
>>> pd.unique( ... pd.Series( ... pd.Categorical(list("baabc"), categories=list("abc"), ordered=True) ... ) ... ) ['b', 'a', 'c'] Categories (3, object): ['a' < 'b' < 'c']
An array of tuples
>>> pd.unique([("a", "b"), ("b", "a"), ("a", "c"), ("b", "a")]) array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)
- pandas.value_counts(values, sort=True, ascending=False, normalize=False, bins=None, dropna=True)[source]
Compute a histogram of the counts of non-null values.
- Parameters:
values (ndarray (1-d)) –
sort (bool, default True) – Sort by values
ascending (bool, default False) – Sort in ascending order
normalize (bool, default False) – If True then compute a relative histogram
bins (integer, optional) – Rather than count values, group them into half-open bins, convenience for pd.cut, only works with numeric data
dropna (bool, default True) – Don’t include counts of NaN
- Return type:
- pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')[source]
Unpivot a DataFrame from wide to long format.
Less flexible but more user-friendly than melt.
With stubnames [‘A’, ‘B’], this function expects to find one or more group of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’)
Each row of these wide variables are assumed to be uniquely identified by i (can be a single column name or a list of column names)
All remaining variables in the data frame are left intact.
- Parameters:
df (DataFrame) – The wide-format DataFrame.
stubnames (str or list-like) – The stub name(s). The wide format variables are assumed to start with the stub names.
i (str or list-like) – Column(s) to use as id variable(s).
j (str) – The name of the sub-observation variable. What you wish to name your suffix in the long format.
sep (str, default "") – A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format. For example, if your column names are A-suffix1, A-suffix2, you can strip the hyphen by specifying sep=’-’.
suffix (str, default '\d+') – A regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes. Suffixes with no numbers could be specified with the negated character class ‘\D+’. You can also further disambiguate suffixes, for example, if your wide variables are of the form A-one, B-two,.., and you have an unrelated column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’. When all suffixes are numeric, they are cast to int64/float64.
- Returns:
A DataFrame that contains each stub name as a variable, with new index (i, j).
- Return type:
See also
meltUnpivot a DataFrame from wide to long format, optionally leaving identifiers set.
pivotCreate a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivotPivot without aggregation that can handle non-numeric data.
DataFrame.pivot_tableGeneralization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstackPivot based on the index values instead of a column.
Notes
All extra variables are left untouched. This simply uses pandas.melt under the hood, but is hard-coded to “do the right thing” in a typical case.
Examples
>>> np.random.seed(123) >>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"}, ... "A1980" : {0 : "d", 1 : "e", 2 : "f"}, ... "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7}, ... "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1}, ... "X" : dict(zip(range(3), np.random.randn(3))) ... }) >>> df["id"] = df.index >>> df A1970 A1980 B1970 B1980 X id 0 a d 2.5 3.2 -1.085631 0 1 b e 1.2 1.3 0.997345 1 2 c f 0.7 0.1 0.282978 2 >>> pd.wide_to_long(df, ["A", "B"], i="id", j="year") ... X A B id year 0 1970 -1.085631 a 2.5 1 1970 0.997345 b 1.2 2 1970 0.282978 c 0.7 0 1980 -1.085631 d 3.2 1 1980 0.997345 e 1.3 2 1980 0.282978 f 0.1
With multiple id columns
>>> df = pd.DataFrame({ ... 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1], ... 'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9] ... }) >>> df famid birth ht1 ht2 0 1 1 2.8 3.4 1 1 2 2.9 3.8 2 1 3 2.2 2.9 3 2 1 2.0 3.2 4 2 2 1.8 2.8 5 2 3 1.9 2.4 6 3 1 2.2 3.3 7 3 2 2.3 3.4 8 3 3 2.1 2.9 >>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age') >>> l ... ht famid birth age 1 1 1 2.8 2 3.4 2 1 2.9 2 3.8 3 1 2.2 2 2.9 2 1 1 2.0 2 3.2 2 1 1.8 2 2.8 3 1 1.9 2 2.4 3 1 1 2.2 2 3.3 2 1 2.3 2 3.4 3 1 2.1 2 2.9
Going from long back to wide just takes some creative use of unstack
>>> w = l.unstack() >>> w.columns = w.columns.map('{0[0]}{0[1]}'.format) >>> w.reset_index() famid birth ht1 ht2 0 1 1 2.8 3.4 1 1 2 2.9 3.8 2 1 3 2.2 2.9 3 2 1 2.0 3.2 4 2 2 1.8 2.8 5 2 3 1.9 2.4 6 3 1 2.2 3.3 7 3 2 2.3 3.4 8 3 3 2.1 2.9
Less wieldy column names are also handled
>>> np.random.seed(0) >>> df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3), ... 'A(weekly)-2011': np.random.rand(3), ... 'B(weekly)-2010': np.random.rand(3), ... 'B(weekly)-2011': np.random.rand(3), ... 'X' : np.random.randint(3, size=3)}) >>> df['id'] = df.index >>> df A(weekly)-2010 A(weekly)-2011 B(weekly)-2010 B(weekly)-2011 X id 0 0.548814 0.544883 0.437587 0.383442 0 0 1 0.715189 0.423655 0.891773 0.791725 1 1 2 0.602763 0.645894 0.963663 0.528895 1 2
>>> pd.wide_to_long(df, ['A(weekly)', 'B(weekly)'], i='id', ... j='year', sep='-') ... X A(weekly) B(weekly) id year 0 2010 0 0.548814 0.437587 1 2010 1 0.715189 0.891773 2 2010 1 0.602763 0.963663 0 2011 0 0.544883 0.383442 1 2011 1 0.423655 0.791725 2 2011 1 0.645894 0.528895
If we have many columns, we could also use a regex to find our stubnames and pass that list on to wide_to_long
>>> stubnames = sorted( ... set([match[0] for match in df.columns.str.findall( ... r'[A-B]\(.*\)').values if match != []]) ... ) >>> list(stubnames) ['A(weekly)', 'B(weekly)']
All of the above examples have integers as suffixes. It is possible to have non-integers as suffixes.
>>> df = pd.DataFrame({ ... 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3], ... 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3], ... 'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1], ... 'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9] ... }) >>> df famid birth ht_one ht_two 0 1 1 2.8 3.4 1 1 2 2.9 3.8 2 1 3 2.2 2.9 3 2 1 2.0 3.2 4 2 2 1.8 2.8 5 2 3 1.9 2.4 6 3 1 2.2 3.3 7 3 2 2.3 3.4 8 3 3 2.1 2.9
>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age', ... sep='_', suffix=r'\w+') >>> l ... ht famid birth age 1 1 one 2.8 two 3.4 2 one 2.9 two 3.8 3 one 2.2 two 2.9 2 1 one 2.0 two 3.2 2 one 1.8 two 2.8 3 one 1.9 two 2.4 3 1 one 2.2 two 3.3 2 one 2.3 two 3.4 3 one 2.1 two 2.9